Spectral norm of products of random and deterministic matrices

Roman Vershynin

Introduction

This paper grew out of an attempt to understand the class of random matrices with non-independent entries, but which can be factorized through random matrices with independent entries. Equivalently, we are interested in sample covariance matrices of a wide class of random vectors – the linear transformations of vectors with independent entries.

For random matrices with independent and identically distributed entries, the spectral norm is well studied. Let WW be an m×nm\times n matrix whose entries are real independent and identically distributed random variables with mean zero, variance 11 and finite fourth moment. Estimates of the type

are known to hold (and are sharp) in both the limit regime for dimensions increasing to infinity, and the non-limit regime where the dimensions are fixed. The meaning of (1.1) in the limit regime is that, for a family of matrices as above whose dimensions mm and nn increase to infinity and whose aspect ratio m/nm/n converges to a constant, the ratio W/(n+m)\|W\|/(\sqrt{n}+\sqrt{m}) converges to 11 almost surely .

In the non-limit regime, i.e. for arbitrary dimensions nn and mm, variants of (1.1) were proved by Y. Seginer and R. Latala . If WW is an m×nm\times n matrix whose entries are i.i.d. mean zero random variables, then denoting the rows of WW by XiX_{i} and the columns by YjY_{j}, the result of Y. Seginer states that

where CC is an absolute constant. This estimate is sharp because W\|W\| is obviously bounded below by the Euclidean norm of any row and any column of WW. Furthermore, if the entries wijw_{ij} of the matrix WW are not necessarily identically distributed, then R. Latala’s result states that

In particular, if WW is an m×nm\times n matrix whose entries are independent random variables with mean zero and fourth moments bounded by 11, then one can deduce from either Y. Seginer’s or R. Latala’s result that

This is a variant of (1.1) in the non-limit regime.

The fourth moment hypothesis is known to be necessary. Consider again a family of matrices whose dimensions mm and nn increase to infinity, and whose aspect ratio m/nm/n converges to a constant. If the entries are independent and identically distributed random variables with mean zero and infinite fourth moment, then the upper limit of the ratio W/(n+m)\|W\|/(\sqrt{n}+\sqrt{m}) is infinite almost surely .

2. The main result

The main result of this paper is an extension of the optimal bound (1.2) to the class of random matrices with non-independent entries, but which can be factored through a matrix with independent entries.

Let ε(0,1)\varepsilon\in(0,1) and let m,n,Nm,n,N be positive integers. Consider a random m×nm\times n matrix W=BAW=BA, where AA is an N×nN\times n random matrix whose entries are independent random variables with mean zero and (4+ε)(4+\varepsilon)-th moment bounded by 11, and BB is an m×Nm\times N non-random matrix such that B1\|B\|\leq 1. Then

where C(ε)C(\varepsilon) is a function that depends only on ε\varepsilon.

1. An important feature of this result is that its conclusion is independent of the dimension NN.

2. The proof of Theorem 1.1 yields the stronger estimate

4. Under the stronger subgaussian moment assumption (1.6) on the entries, Theorem 1.1 is easy to prove using standard concentration and an ε\varepsilon-net argument. In contrast, if only some finite moment is assumed, we do not know any simple proof.

3. The smallest singular value

Our main motivation for Theorem 1.1 was to complete the analysis of the smallest singular value of random rectangular matrices carried out by M. Rudelson and the author in . The smallest singular value smin(W)s_{\min}(W) of a matrix WW can be equivalently described as smin(W)=infxWx2/x2s_{\min}(W)=\inf_{x}\|Wx\|_{2}/\|x\|_{2}.

Analyzing the smallest singular value is generally harder than analyzing the largest one (the spectral norm). The analogue of (1.1) for the smallest singular value of random m×nm\times n matrices WW (for m>nm>n) is

The optimal limit version of this result proved in holds under exactly the same hypotheses as (1.1) – for i.i.d. entries with mean zero, variance 11 and finite fourth moment.

Many papers addressed (1.5) for fixed dimensions nn, mm. Sufficiently tall matrices (mCnm\geq Cn for sufficiently large CC) were studied in ; extensions to genuinely rectangular matrices (m>(1+ε)nm>(1+\varepsilon)n for some ε>0\varepsilon>0) were studied in , with gradually improving dependence on ε\varepsilon. An optimal version of (1.5) for all dimensions was obtained in . All these works put somewhat stronger moment assumptions than the fourth moment of the entries wijw_{ij} of the matrix WW. A convenient assumption is that the entries wijw_{ij} are subgaussian random variables. This means that all their moments are bounded by the corresponding moments of the standard normal random variable, i.e.

where MM is called the subgaussian moment. It was proved in that if the entries of WW are i.i.d. mean zero subgaussian random variables with unit variance, then for every t>0t>0 one has

where C,c>0C,c>0 depend only on the subgaussian moment MM. In particular, for such matrices we have

where c1>0c_{1}>0 depends only on the desired probability and the subgaussian moment. This result encompasses the case of square matrices where m=nm=n and hence (1.8) yields smin(W)c2/ns_{\min}(W)\geq c_{2}/\sqrt{n}. For Gaussian square matrices this optimal bound was obtained in and ; for general square matrices a weaker bound n3/2n^{-3/2} was obtained in and the best bound as above in ; the estimate is shown to be optimal in .

Whether (1.8) holds under weaker moment assumptions was only known in the case of square matrices. It was proved in using (1.2) that (1.8) holds under the fourth moment assumption for square matrices, i.e. for m=nm=n. Whether the same is true for arbitrary rectangular matrices under the fourth moment assumption was left open in . The bottleneck of the argument occurred in Proposition 7.3 on where we needed a correct bound on the spectral norm of a product of a random matrix and a fixed orthogonal projection. Such a bound was easy to get only under the subgaussian hypothesis. Theorem 1.1 of the present paper extends the argument of for random matrices with bounded (4+ε)(4+\varepsilon)-th moment. It follows directly from the argument of and Theorem 1.1.

Let ε(0,1)\varepsilon\in(0,1) and mnm\geq n be positive integers. Let AA be a random m×nm\times n matrix whose entries are i.i.d. random variables with mean zero, unit variance and (4+ε)(4+\varepsilon)-th moment bounded by MM. Then, for every δ>0\delta>0 there exist t>0t>0 and n0n_{0} which depend only on ε\varepsilon, δ\delta and MM, and such that

This result follows by the argument in , where one considers probability estimates conditional on the event that the norm of a product WW of a random matrix and a non-random orthogonal projection is small (see [28, Proposition 7.3]).

After this paper was written, two important related results appeared on the universality of the smallest singular value in two extreme regimes – for almost square matrices and for genuinely rectangular matrices. One of these results, by T. Tao and V. Vu works for square and almost square matrices where the the defect mnm-n is constant. It is valid for matrices with i.i.d. entries with mean zero, unit variance and bounded CC-th moment where CC is a sufficiently large absolute constant. The result states that the smallest singular value of such m×nm\times n matrices AA is asymptotically the same as of the Gaussian matrix GG of the same dimensions and with i.i.d. standard normal entries. Specifically,

This universality result, combined with the known asymptotic estimates of the smallest singular value of Gaussian matrices smin(G)s_{\min}(G) allows one to obtain bounds sharper than in Corollary 1.2. However, the universality result of is only known in the almost square regime mn=O(1)m-n=O(1) (and under stronger moment assumptions), while Corollary 1.2 is valid for all dimensions mnm\geq n.

Another recent universality result was obtained by O. Feldheim and S. Sodin for genuinely rectangular matrices, i.e. with aspect ratio m/nm/n separated from 11 by a constant, and with subgaussian i.i.d. entries. In particular they proved the inequality

Deviation inequalities (1.7) and (1.10) complement each other – the former is multiplicative (and is valid for arbitrary dimensions) while the latter is additive (and is applicable for genuinely rectangular matrices). Each of these two inequalities clearly has the regime where it is stronger.

4. Outline of the argument

This bound is already independent of the dimension NN, but is off by logn\sqrt{\log n} from being optimal. The logarithmic term is unfortunately a limitation of this method. This term comes from M. Rudelson’s result, Theorem 3.1 below, where it is needed in full generality. It would be useful to understand the situations where the logarithmic term can be removed from M. Rudelson’s theorem. So far, only one such situation is known from where the independent random vectors XjX_{j} are uniformly distributed in a convex body.

The advantage of almost square matrices is that the magnitude of their entries is easy to control. A simple consequence of the (4+ε)(4+\varepsilon)-th moment hypothesis and Markov’s inequality yields that the entries of A=(aij)A=(a_{ij}) satisfy maxi,jaijn\max_{i,j}|a_{ij}|\leq\sqrt{n} with high probability. Note that the same estimate holds for square matrices (N=nN=n) under the fourth moment assumption. So, in regard to the magnitude of entries, almost square matrices are similar to exactly square matrices, for which the desired bound follows from R. Latala’s result (1.2).

This prompts us to construct the proof of Theorem 1.1 for almost square matrices similarly to R. Latala’s argument in , i.e. using fairly standard concentration of measure results in the Gauss space, coupled with delicate constructions of nets. We first decompose AA into a sum of matrices which contain entries of similar magnitude. As the magnitude increases, these matrices become sparser. This quickly reduces the problem to random sparse matrices, whose entries are i.i.d. random variables valued in {1,0,1}\{-1,0,1\}. The spectral norm of random sparse matrices was studied in as a development of the work of Z. Furedi and J. Komlos . However, we need to bound the spectral norm of the matrix W=BAW=BA rather than AA. Independence of entries is not available for WW, which makes it difficult to use the known combinatorial methods based on the bounding trace of high powers of WW.

Acknowledgement

The author is grateful for the referee for careful reading of the manuscript, and for many suggestions which greatly improved the presentation.

Preliminaries

Throughout the paper, the results are stated and proved over the field of real numbers. They are easy to generalize to complex numbers.

We denote by C,C1,c,c1C,C_{1},c,c_{1}\ldots positive absolute constants, and by C(ε),C1(ε),C(\varepsilon),C_{1}(\varepsilon),\ldots positive quantities that may depend only on the parameter ε\varepsilon. Their values can change from line to line.

2. Concentration of measure

The method that we carry out in Section 4 uses concentration in the Gauss space in combination with constructions of ε\varepsilon-nets. Here we recall some basic facts we need.

where c0(0,1)c_{0}\in(0,1) is an absolute constant.

As a very restrictive but useful example, Theorem 2.1 implies the following deviation inequality for sums of independent exponential random variables gi2g_{i}^{2} (which can also be derived by the more standard approach via moment generating functions).

Let d=(d1,,dm)d=(d_{1},\ldots,d_{m}) be a vector of real numbers, and let g1,,gmg_{1},\ldots,g_{m} be independent standard normal random variables. Then, for every t>0t>0 we have

Another classical deviation inequality we will need is Bennett’s inequality, see e.g. [9, Theorem 2]:

Let X1,,XNX_{1},\ldots,X_{N} be independent mean zero random variables such that Xi1|X_{i}|\leq 1 for all ii. Consider the sum S=X1++XNS=X_{1}+\cdots+X_{N} and let σ2:=Var(S)\sigma^{2}:=\operatorname{Var}(S). Then, for every t>0t>0 we have

We will also need M. Talagrand’s concentration inequality for convex Lipschitz funcitons from [31, Theorem 6.6]; see also [18, Corollary 4.10] and the discussion below it.

3. Nets

Consider a subset UU of a normed space XX, and let ε>0\varepsilon>0. Recall that an ε\varepsilon-net of UU is a subset N\mathcal{N} of UU such that the distance from any point of UU to N\mathcal{N} is at most ε\varepsilon. In other words, for every xUx\in U there exists yNy\in\mathcal{N} such that xyXε\|x-y\|_{X}\leq\varepsilon.

The following estimate follows by a volumetric argument, see e.g. the proof of Lemma 9.5 in .

When computing norms of linear operators, ε\varepsilon-nets provide a convenient discretization of the problem. We formalize it in the next proposition.

Let A:XYA:X\to Y be a linear operator between normed spaces XX and YY, and let N\mathcal{N} be an ε\varepsilon-net of either the unit sphere S(X)S(X) or the unit ball B(X)B(X) of XX for some ε(0,1)\varepsilon\in(0,1). Then

We give the proof for an ε\varepsilon-net of the unit sphere; the case of the unit ball is similar. Every zS(X)z\in S(X) has the form z=x+hz=x+h, where xNx\in\mathcal{N} and hXε\|h\|_{X}\leq\varepsilon. Since A=supzS(X)AzY\|A\|=\sup_{z\in S(X)}\|Az\|_{Y}, the triangle inequality yields

The last term in the right hand side is bounded by εA\varepsilon\|A\|. Thus we have shown that

4. Symmetrization

We will use the standard symmetrization technique as was done in ; see more general inequalities in e.g. [19, Section 6.1]. To this end, let the matrices A=(aij)A=(a_{ij}) and BB be as in Theorem 1.1. Let A=(aij)A^{\prime}=(a^{\prime}_{ij}) be an independent copy of AA, and let εij\varepsilon_{ij} be independent symmetric Bernoulli random variables. Then, by Jensen’s inequality,

Therefore, we can assume without loss of generality in Theorem 1.1 that aija_{ij} are symmetric random variables. Furthermore, let gijg_{ij} be independent standard normal random variables. Then, again by Jensen’s inequality,

Conditioning on aija_{ij}, we thus reduce the problem to random gaussian matrices.

We will use a similar symmetrization technique several times in our argument. In particular, in the proof of Lemma 3.8 we apply the following observation, which can be deduced from standard symmetrization lemma ( Lemma 6.3) and the contraction principle ( Theorem 4.4). For the reader’s convenience we include a direct proof.

Consider independent mean zero random variables ZijZ_{ij} such that Zij1|Z_{ij}|\leq 1, independent symmetric Bernoulli random variables εij\varepsilon_{ij}, and vectors xijx_{ij} in some Banach space, where both ii and jj range in some finite index sets. Then

To be specific, we can assume that both indices ii and jj range in the interval {1,,n}\{1,\ldots,n\} for some integer nn. Let (Zij)(Z^{\prime}_{ij}) denote an independent copy of the sequence of random variables (Zij)(Z_{ij}). Then ZijZijZ_{ij}-Z^{\prime}_{ij} are symmetric random variables. We have

is a convex function. Therefore, on the compact convex set n2^{n^{2}} it attains its maximum on the extreme points, where all aij=±1a_{ij}=\pm 1. By symmetry, the function takes the same value at each extreme point, which equals

5. Truncation and conditioning

We will need some elementary observations related to truncation and conditioning of random variables.

Let XX be a non-negative random variable, and let M>0M>0, p1p\geq 1. Then

We will also need two elementary conditioning lemmas. In Section 4, we will need to control the maximal magnitude of the entries M0=maxijaijM_{0}=\max_{ij}|a_{ij}| of the random matrix AA. Conditioning on M0M_{0} will unfortunately destroy the independence of the entries. So, we will instead condition on an event {M0t}\{M_{0}\leq t\} for fixed tt, which will clearly preserve the independence. This conditional argument used in the proof of Corollary 4.11 relies on the following two elementary lemmas.

Let XX be a random variable and KK be a real number. Then

Let XX, YY be non-negative random variables. Assume there exists K,L>0K,L>0 such that one has for every t1t\geq 1:

Without loss of generality we can assume that K=1K=1 by rescaling XX to X/KX/K. Thus we have for every t1t\geq 1:

By (2.3) and Hölder’s inequality, the first term is bounded as

Further terms can be estimated by Cauchy-Schwarz inequality and using (2.3) and the second inequality in (2.2). Indeed,

6. On the deterministic matrix B𝐵B in Theorem 1.1.

We start with two initial observations that will make our proof of Theorem 1.1 more transparent. By adding an appropriate number of zero rows to BB or zero columns to AA we can assume without loss of generality that n=mn=m, thus BB is an n×Nn\times N matrix.

where HS\|\cdot\|_{\rm{HS}} denotes the Hilbert-Schmidt norm. Throughout the argument, we will only have access to the matrix BB through inequalities (2.4). This explains Remark 2 following Theorem 1.1, which states that the range space of BB is irrelevant as long as we control the spectral and Hilbert-Schmidt norms of BB.

Approach via M. Rudelson’s theorem

In particular, for every t>0t>0, with probability at least 12mect21-2me^{-ct^{2}} one has

The first estimate is taken from [22, inequality (3.4)]. The second estimate can be easily derived from it using the following elementary lemma:

Suppose a non-negative random variable XX satisfies for some m1m\geq 1 that

Next, if t<max(1,logm)t<\max(1,\sqrt{\log m}) then by choosing the absolute constant c>0c>0 sufficiently small right hand side of (3.1) is larger than 11 for a sufficiently small absolute constant cc . Therefore, for every t>0t>0 one has

because if t<max(1,logm)t<\max(1,\sqrt{\log m}) then the right hand side of (3.1) is larger than one, which makes the inequality trivial. This completes the proof. ∎

The next lemma is a consequence of M. Rudelson’s Theorem 3.1 and a standard symmetrization argument.

Let ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} be independent symmetric Bernoulli random variables. By the triangle inequality, the standard symmetrization argument (see e.g. [19, Lemma 6.3]), and the assumption, we have

Now we take expectation with respect to X1,,XnX_{1},\ldots,X_{n} and use Cauchy-Schwarz inequality to get

2. Theorem 1.1 up to a logarithmic term

We now state a version of Theorem 1.1 with a logarithmic factor.

Let N,nN,n be positive integers. Consider an N×nN\times n random matrix AA whose entries are independent random variables with mean zero and 44-th moment bounded by 11. Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1. Then

The proof will need two auxiliary lemmas. Recall that B1,,BNB_{1},\ldots,B_{N} denote the columns of the matrix BB.

The estimate on the expectation follows easily from (2.4):

To estimate the variance, we need to compute

By independence and the mean zero assumption, the only nonzero terms in this sum are those for which i=j;k=li=j;k=l or i=k;j=li=k;j=l or i=l;j=ki=l;j=k. Therefore

By the fourth moment assumption and using (2.4) we have

This result says that all columns of the matrix BABA have norm O(n)O(\sqrt{n}) with high probability. Since the spectral norm of a matrix is bounded below by the norm of any column, this result is a necessary step in proving our desired estimate BA=O(n)\|BA\|=O(\sqrt{n}).

Let us fix j{1,,n}j\in\{1,\ldots,n\} and use Lemma 3.5. This gives

Now we use Chebychev’s inequality, which states that for a random variable ZZ with σ2=Var(Z)\sigma^{2}=\operatorname{Var}(Z) and for an arbitrary k>0k>0, one has

Let t>0t>0 be arbitrary. Using Chebychev’s inequality along with (3.5) for Z=Xj22Z=\|X_{j}\|_{2}^{2}, k=tnk=t\sqrt{n}, we obtain

Taking the union bound over all j=1,,nj=1,\ldots,n, we conclude that

This shows that condition (3.2) holds. Lemma 3.3 then gives

Estimating the maximum in the right hand side using Lemma 3.6, we conclude that

3. Tradeoff between the matrix norm and the magnitude of entries

We would like now to gain more control over the logarithmic factor than we have in Proposition 3.4. Our next result establishes a tradeoff between the logarithmic factor and the magnitude of the matrices AA, BB. It will be used in the proof of Theorem 3.9.

Let a,b0a,b\geq 0 and N,nN,n be positive integers. Let AA be an N×nN\times n matrix whose entries are random independent variables aija_{ij} with mean zero and such that

Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1, and whose columns satisfy

The proof will again be based on M. Rudelson’s Theorem 3.1, although this time we use Rudelson’s theorem in a more delicate way:

Under the assumptions of Proposition 3.7, we have

Next, clearly μij2a2\mu_{ij}^{2}\leq a^{2}, so

where εij\varepsilon_{ij} denote independent symmetric Bernoulli random variables.

Let t>0t>0. By the second part of M. Rudelson’s Theorem 3.1 and taking the union bound over nn random variables, we conclude that, with probability at least 12n2ect21-2n^{2}e^{-ct^{2}}, we have

The second estimate follows from (3.7) and since maxiBi2b\max_{i}\|B_{i}\|_{2}\leq b by the hypothesis.

Let s>0s>0 be arbitrary. We apply the above estimate for tt chosen so that 2n2ect2=es22n^{2}e^{-ct^{2}}=e^{-s^{2}}. This shows that, with probability at least 1es21-e^{-s^{2}}, one has

Putting this into (3.9) and, together with (3.8), back into (3.6), we complete the proof. ∎

which does not depend on the random variables (gij)(g_{ij}), has expectation

We condition on the random variables (aij)(a_{ij}); this fixes a value of KK.

Consider a (1/2)(1/2)-net N\mathcal{N} of the unit Euclidean sphere Sn1S^{n-1} of cardinality N5n|\mathcal{N}|\leq 5^{n}, which exists by Lemma 2.5. Using Proposition 2.6, we have

Fix xNx\in\mathcal{N}. For every j=1,,nj=1,\ldots,n, the random variable

is a Gaussian random variable with mean zero and variance

(To obtain the first inequality, take the supremum over xSn1x\in S^{n-1}). Therefore, by Corollary 2.2 with di=(VarXi,x)1/2Kd_{i}=(\operatorname{Var}\langle X_{i},x\rangle)^{1/2}\leq K, we have for every t>0t>0:

Let s>0s>0 be arbitrary. The previous estimate for t=sKnt=sK\sqrt{n} gives

Taking the union bound over xNx\in\mathcal{N} and using (3.11), we obtain

Finally, we take expectation with respect to the random variables (aij)(a_{ij}) and use (3.10) to conclude that

4. Theorem 1.1 for logarithmically small columns

Our next step is to combine Propositions 3.4 and 3.7 and obtain a weaker version of the main Theorem 1.1 – this time with the correct bound O(n)O(\sqrt{n}) on the norm, but under the additional assumption that the columns of the matrix BB are logarithmically small.

Let ε(0,1)\varepsilon\in(0,1) and let N,nN,n be positive integers. Consider an N×nN\times n random matrix AA whose entries are independent random variables with mean zero and (4+ε)(4+\varepsilon)-th moment bounded by 11. Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1, and whose columns satisfy for some M1M\geq 1 that

By the symmetrization argument described in Section 2, we can assume without loss of generality that all entries aija_{ij} of the matrix A=(aij)A=(a_{ij}) are symmetric random variables. Let

We decompose every entry of the matrix AA according to its absolute value as

The norm of BAˉB\bar{A} can be bounded using Proposition 3.7, which we can apply with aa as above and b=Mlog121ε(2n)b=M\log^{-\frac{1}{2}-\frac{1}{\varepsilon}}(2n). This gives

where the last inequality follows by our choice of aa and bb.

Putting the two estimates together, we conclude by the triangle inequality that

The factor M1/2M^{1/2} in the conclusion of Theorem 3.9 can easily be improved to about Mε/2M^{\varepsilon/2} by choosing a=tlog12ε(2n)a=t\log^{\frac{1}{2\varepsilon}}(2n) in the proof and optimizing in tt. We will not need this improvement in our argument.

Approach via concentration

In this section, we develop an alternative way to bound the norm of BABA, which rests on Gaussian concentration inequalities and elaborate choice of ε\varepsilon-nets. The main technical result of this section is the following theorem, which, like Theorem 3.9, gives the correct bound O(n)O(\sqrt{n}) under some boundedness assumptions on the entries of AA.

Let ε(0,1)\varepsilon\in(0,1), M1M\geq 1 and let NnN\geq n be positive integers such that log(2N)Mn\log(2N)\leq Mn. Consider an N×nN\times n random matrix AA whose entries are independent random variables aija_{ij} with mean zero and such that

Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1. Then

where C(ε)C(\varepsilon) depends only on ε\varepsilon.

1. If the entries aija_{ij} have bounded (4+ε)(4+\varepsilon)-th moment, it is easy to check that maxijaij(nN)14+ε\max_{ij}a_{ij}\sim(nN)^{\frac{1}{4+\varepsilon}} holds with high probability. Therefore, under the (4+ε)(4+\varepsilon)-th moment assumption, the hypotheses of Theorem 4.1 are satisfied for almost square matrices, i.e. those for which Nn1+cεN\leq n^{1+c\varepsilon}. This will quickly yield the main Theorem 1.1 for almost square matrices, see Corollary 4.11 below.

3. Using M. Talagrand’s concentration result, Theorem 2.4, one can also obtains tail bounds for the norm BA\|BA\|:

Under the assumptions of Theorem 4.1, one has for every t>0t>0:

In particular, one has for every q1q\geq 1:

Theorem 4.1 will follow from our analysis of sparse matrices. We will decompose the entries aija_{ij} according to their magnitude. As the magnitude increases, the moment assumptions will ensure that there will be fewer such entries, i.e. the resulting matrix becomes sparser.

We start with an elementary lemma, which will help us analyze the magnitude of the rows and columns of the matrix BABA when AA is a sparse matrix.

Let N,nN,n be positive integers. Consider independent random variables aija_{ij}, i=1,,Ni=1,\ldots,N, j=1,,nj=1,\ldots,n. Let p(0,1]p\in(0,1], and suppose that

Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1, whose columns are denoted BiB_{i}. Then

We will only prove inequality (4.2); the proof of inequality (4.1) is similar. By the assumptions, we have

Consider the sums of independent random variables

The above estimates show that for every jj we have

Taking the union bound over all jj, we conclude that

Now let s1s\geq 1 be arbitrary, and use the last inequality for t=(np+log(2n))st=(np+\log(2n))s. We obtain

The estimates in Lemma 4.3 motivate us to consider the class of N×nN\times n matrices A=(aij)A=(a_{ij}) whose entries satisfy the following inequalities for some parameters p(0,1]p\in(0,1] and K1K\geq 1:

2. Concentration for a fixed vector

Our goal will be to estimate the magnitude of BA\|BA\| for matrices of the form A=(gijaij)A=(g_{ij}a_{ij}), where gijg_{ij} are independent standard normal random variables, and aija_{ij} are fixed numbers that satisfy conditions (4.4). Such an estimate will be established in Proposition 4.8 below. By the standard symmetrization, the same estimate will hold true if A=(aij)A=(a_{ij}) is a random matrix with entries as in Lemma 4.3. This will be done in Corollary 4.9. Finally, Theorem 4.1 will be deduced from this by decomposing the entries of a random matrix according to their magnitude.

Our first step toward this goal is to check the magnitude of BAx2\|BAx\|_{2} for a fixed vector xx.

Let N,nN,n be positive integers. Consider an N×nN\times n random matrix A=(gijaij)A=(g_{ij}a_{ij}) where gijg_{ij} are independent standard normal random variables and aija_{ij} are numbers that satisfy conditions (4.4). Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1. Then, for every vector xB2nx\in B_{2}^{n} we have

Denoting as usual the columns of BB by BiB_{i}, we have

Since x21\|x\|_{2}\leq 1 and using the last condition in (4.4), we have

We will now strengthen Lemma 4.4 into a deviation inequality for BAx2\|BAx\|_{2}. This is a simple consequence of the Gaussian concentration, Theorem 2.1. This deviation inequality is universal in that it holds for any vector xx; in the sequel we will need more delicate inequalities that depend on the distribution of the coordinates in xx.

Let AA and BB be matrices as in Lemma 4.4. Then, for every vector xB2nx\in B_{2}^{n} and every t>0t>0 we have

where BiB_{i} are the columns of the matrix BB. Therefore, the random vector BAxBAx is distributed identically with the random vector

and where gig_{i} are independent standard normal random variables. Since all aij1|a_{ij}|\leq 1 by conditions (4.4), and x21\|x\|_{2}\leq 1 by the assumptions, we have

Then the Gaussian concentration, Theorem 2.1, gives for every t>0t>0:

where g=(g1,,gN)g=(g_{1},\ldots,g_{N}). Since as we noted above, f(g)f(g) is distributed identically with BAx2\|BAx\|_{2}, Lemma 4.4 completes the proof. ∎

3. Control of sparse vectors

Nevertheless, the bound in Lemma 4.5 can be made uniform over a set of sparse vectors, whose metric entropy is smaller than that of the whole sphere:

Let AA and BB be matrices as in Lemma 4.4. There exists an absolute constant c>0c>0 such that the following holds. Consider the set of vectors

Let c>0c>0 be a constant to be determined later, and let λ:=cp/log(e/p)\lambda:=cp/\log(e/p). Then

Using Proposition 2.6 and taking the union bound over all xNJx\in\mathcal{N}_{J}, we obtain

Since there are (nλn)(e/λ)λn\binom{n}{\lfloor\lambda n\rfloor}\leq(e/\lambda)^{\lambda n} ways to choose the subset JJ, by taking the union bound over all JJ we conclude that

Finally, if the absolute constant c>0c>0 in the definition of λ\lambda is chosen sufficiently small, we have λlog(e/λ)n+2λnc0np\lambda\log(e/\lambda)n+2\lambda n\leq c_{0}np. Thus the right hand side of (4.6) is at most

4. Control of spread vectors

Although we now have a good control of sparse vectors, they unfortunately comprise a small part of the unit ball B2nB_{2}^{n}. More common but harder to deal with are “spread vectors” – those having many coordinates that are not close to zero. The next result gains control of the spread vectors.

Let AA and BB be matrices as in Lemma 4.4 with NnN\geq n. Let M2M\geq 2. Consider the set of vectors

This time we will need to work with multiple nets to account for different possible distributions of the magnitude of the coordinates of vectors xB2,x\in B_{2,\infty}. Since xx2\|x\|_{\infty}\leq\|x\|_{2}, without loss of generality we can assume that MnM\leq\sqrt{n}.

A standard calculation shows that N\mathcal{N} is an 12\frac{1}{2}-net of B2,B_{2,\infty} in the B2,B_{2,\infty}-norm, i.e. for every xB2,x\in B_{2,\infty} there exists yNy\in\mathcal{N} such that xy12B2,x-y\in\frac{1}{2}B_{2,\infty}. Therefore, by Proposition 2.6,

Fix xNx\in\mathcal{N}. Since x21\|x\|_{2}\leq 1, the number of coordinates of xx that satisfy xj=hk|x_{j}|=h_{k} is at most hk2\lfloor h_{k}^{-2}\rfloor, for every kk. Decomposing xx according to the coordinates whose absolute value is hkh_{k}, we have by the triangle inequality that

Fix kk and assume that Nk\mathcal{N}_{k}\neq\emptyset. Since hkM/nh_{k}\leq M/\sqrt{n}, we have

To estimate the cardinality of Nk\mathcal{N}_{k}, note that there are at most min(m,n)\min(m,n) ways to choose x0:=l\|x\|_{0}:=l; there are (nl)\binom{n}{l} ways to choose the support of xx; and there are 2l2^{l} ways to choose the (signs of) nonzero coordinates of xx. Hence by Stirling’s approximation and using (4.8), we have

Step 2: control of a fixed vector. Fix mm and fix xNkx\in\mathcal{N}_{k}. As we saw in the proof of Lemma 4.5,

and where gig_{i} are independent standard normal random variables. Since xNkx\in\mathcal{N}_{k}, we have x=hk1m\|x\|_{\infty}=h_{k}\leq\frac{1}{\sqrt{m}}. This and the second condition in (4.4) yield

Repeating the estimate in the proof of Lemma 4.5, we bound the Lipschitz norm as

Then the Gaussian concentration, Theorem 2.1, gives for every t>0t>0:

where g=(g1,,gN)g=(g_{1},\ldots,g_{N}). Since as we noted above, f(g)f(g) is distributed identically with BAx2\|BAx\|_{2}, Lemma 4.4 yields that

Let u>0u>0 be arbitrary. Applying the above estimate for t=uKnp+log(2N)t=uK\sqrt{np+\log(2N)} and using NnN\geq n we conclude that

Step 3: union bound. Taking the union bound in (4.10) over all xNkx\in\mathcal{N}_{k} and using estimate (4.9) on the cardinality of Nk\mathcal{N}_{k}, we have for all u>0u>0:

Let s1s\geq 1. We choose u=C1slogMu=C_{1}s\sqrt{\log M}, where C1:=C/c0C_{1}:=\sqrt{C/c_{0}}. Since u1u\geq 1 and m1m\geq 1, M2M\geq 2, we obtain from the above estimate that

Putting this back in (4.7), we conclude that

5. Norms of sparse matrices, and proof of Theorem 4.1

Propositions 4.6 and 4.7 together handle all vectors in the unit ball, and yield the following norm estimate:

Let AA and BB be matrices as in Lemma 4.4 with NnN\geq n. Then

Let cc be the absolute constant as in Proposition 4.6; we can clearly assume that c1/4c\leq 1/4. We define

Note that M2M\geq 2 as required in Proposition 4.6.

Fix a vector xB2nx\in B_{2}^{n}. We decompose it according to the magnitude of the coordinates, as follows:

Clearly, y2x21\|y\|_{2}\leq\|x\|_{2}\leq 1, z2x21\|z\|_{2}\leq\|x\|_{2}\leq 1. By Markov’s inequality, we have

Then yB2,0y\in B_{2,0} as in Proposition 4.6. On the other hand, zM/n\|z\|_{\infty}\leq M/\sqrt{n} by definition, so zB2,z\in B_{2,\infty} as in Proposition 4.7. Therefore, by Propositions 4.6 and 4.7 we have

Our choice of MM and the assumption NnN\geq n completes the proof. ∎

Finally, a standard symmetrization argument yields the following norm estimate, which we shall use for sparse random matrices.

Let p(0,1]p\in(0,1] and let NnN\geq n be positive integers. Consider an N×nN\times n random matrix AA whose entries are independent random variables aija_{ij} with mean zero and such that

Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1. Then

It would be interesting to remove the logarithmic term from this estimate.

By Lemma 4.3, conditions (4.4) hold with some random parameter K1K\geq 1 which only depends on the random variables (aij)(a_{ij}) and not on (gij)(g_{ij}), and which satisfies

Condition on the random variables (aij)(a_{ij}). Proposition 4.8 then yields

Therefore, when we remove the conditioning, we obtain by (4.12) that

Then we have a decomposition A=k=02kA(k)A=\sum_{k=0}^{\infty}2^{k}A^{(k)}. This sum is actually finite because of the boundedness assumption on aija_{ij}. Indeed, we have

where k0k_{0} is the maximal integer such that

Using Corollary 4.9 for the matrix A(0)A^{(0)} and p=1p=1, we obtain

where the last line follows because log(2N)Mn\log(2N)\leq Mn and M1M\geq 1 by the hypothesis.

Now we fix 1kk01\leq k\leq k_{0}. Using the (2+ε)(2+\varepsilon)-th moment assumption, we have by Markov’s inequality that

By the definition of pkp_{k} and by (4.14), we have

Therefore, npk+log(2N)(1+M)npk2Mnpknp_{k}+\log(2N)\leq(1+M)np_{k}\leq 2Mnp_{k}, so

Using (4.13) and the triangle inequality, then using (4.15) and (4.5), we conclude that

This completes the proof of Theorem 4.1. ∎

6. Almost square matrices

The main application of Theorem 4.1 is for almost square matrices – those for which N=n1+o(1)N=n^{1+o(1)}. The next lemma verifies the hypotheses of Theorem 4.1 for such matrices.

Let ε(0,1)\varepsilon\in(0,1) and let N,nN,n be positive integers satisfying Nn1+ε/10N\leq n^{1+\varepsilon/10}. Let AA be an N×nN\times n random matrix whose entries are independent random variables with (4+ε)(4+\varepsilon)-th moment bounded by 11. Define the random variable MM by the equation

By Markov’s inequality, we have for every i,ji,j that

Taking the union bound over all nNnN random variables aija_{ij}, we obtain

The assumption Nn1+ε/10N\leq n^{1+\varepsilon/10} yields that

Therefore, since 2+ε/84+ε12+ε/4\frac{2+\varepsilon/8}{4+\varepsilon}\leq\frac{1}{2+\varepsilon/4} and t1t\geq 1, we have

We are now ready to state and prove a partial case of Theorem 1.1 for almost square matrices.

Let ε(0,1)\varepsilon\in(0,1) and let N,nN,n be positive integers satisfying Nn1+ε/10N\leq n^{1+\varepsilon/10}. Let AA be an N×nN\times n random matrix whose entries are independent random variables with mean zero and (4+ε)(4+\varepsilon)-th moment bounded by 11. Let BB be an n×Nn\times N matrix such that B1\|B\|\leq 1. Then

Without loss of generality we may assume that NnN\geq n by adding an appropriate number of zero rows to AA and zero columns to BB. Also, using the standard symmetrization, we can assume that the random variables aija_{ij} are symmetric. Let MM be the random variable as in Lemma 4.10, and let t1t\geq 1. By the definition, {Mt}\{M\leq t\} is the product event. Therefore, conditioning on this event (i) preserves the independence of the entries of AA; (ii) makes all these entries bounded as in (4.17); (iii) can only reduce their moments by Lemma 2.9, thus for all i,ji,j we have

Therefore, we can apply Corollary 4.2 conditionally, with ε/4\varepsilon/4 and with MM replaced by max(M,10)\max(M,10), which gives

Completion of the proof of Theorem 1.1

By adding an appropriate number of zero rows to BB or zero columns to AA we can assume that m=nm=n, thus BB is an n×Nn\times N matrix. Consider the exponent

As usual, let B1,,BNB_{1},\ldots,B_{N} be the columns of the matrix BB. Consider the subset I{1,,N}I\subset\{1,\ldots,N\} of large columns defined as

Here we choose C0(ε)C_{0}(\varepsilon) sufficiently large so that, by (2.4) and Markov’s inequality, we have

Denote by AIA_{I} the N0×nN_{0}\times n submatrix of AA whose rows are in II, by BIB_{I} the n×N0n\times N_{0} submatrix of BB whose columns are in II (and similarly for IcI^{c}). The decomposition BA=BIAI+BIcAIcBA=B_{I}A_{I}+B_{I^{c}}A_{I^{c}} implies by the triangle inequality that

This splits our problem into two subproblems, one for II and one for IcI^{c}. Of course, if II or IcI^{c} is empty then the corresponding matrix is zero and we can skip its estimation.

The matrices AIA_{I}, BIB_{I} are almost square, so Corollary 4.11 applies for them, giving

On the other hand, the columns of the matrix BIcB_{I^{c}} are small by the definition of II:

Therefore, Theorem 3.9 applies to the matrices AIcA_{I^{c}}, BIcB_{I^{c}}, which gives

Putting estimates (5.2) and (5.3) into (5.1), we conclude that

References