The smallest singular value of random rectangular matrices with no moment assumptions on entries

Konstantin E. Tikhomirov

Introduction

In last years, spectral properties of random matrices with fixed dimensions (the corresponding theory is often called non-asymptotic) have attracted considerable attention of researchers, whose efforts have been mostly concentrated on studying distributions of the largest and the smallest singular values. For detailed information on the development of the subject, we refer the reader to surveys , .

Let NnN\geq n. Given an N×nN\times n random matrix AA, we employ a usual notation s1(A):=maxySn1Ays_{1}(A):=\max\limits_{y\in S^{n-1}}\|Ay\|; sn(A):=infySn1Ays_{n}(A):=\inf\limits_{y\in S^{n-1}}\|Ay\|. A limiting result of Z.D. Bai and Y.Q. Yin suggests that for an N×nN\times n matrix with i.i.d. mean zero entries with unit variance and a finite fourth moment, its largest and smallest singular values should “concentrate” near N+n\sqrt{N}+\sqrt{n} and Nn\sqrt{N}-\sqrt{n}, respectively. In the non-asymptotic setting one is interested, in particular, in finding the weakest possible conditions on random matrices that would imply s1N+ns_{1}\lesssim\sqrt{N}+\sqrt{n} and snNns_{n}\gtrsim\sqrt{N}-\sqrt{n} with a large probability.

Various estimates for the extremal singular values were obtained when studying the problem of approximating covariance matrix of a random vector by the empirical covariance matrix. Answering a question of R. Kannan, L. Lovász and M. Simonovits, the authors of treated log-concave random vectors. Later, the log-concavity was replaced by weaker assumptions (see, in particular, , , , ).

The assumption of isotropicity of a random vector or, more generally, boundedness of variance of its coordinates is quite natural and appears as part of requirements on a matrix’ rows in all the aforementioned papers. However, for a deeper understanding of non-asymptotic characteristics of random matrices, an important question is whether any moment assumptions on entries are really necessary in order to get satisfactory lower estimates for the smallest singular value.

Unlike in and where the matrix entries within a given row are not necessarily independent, in our paper we consider the classical setting when a rectangular matrix has i.i.d. entries. However, in contrast with all the mentioned results, the lower estimate for the smallest singular value that we prove does not use any moment assumptions; the only requirement is that the distribution of entries satisfies a “spreading” condition given in terms of the Levy concentration function. Moreover, compared to and , we significantly relax the assumptions on the aspect ratio of the matrix.

Given a real random variable ξ\xi, the concentration function of ξ\xi is defined as

The main result of our paper is the following theorem:

Then for any non-random N×nN\times n matrix BB we have

Adding a non-random component BB in the theorem does not increase complexity of the proof; on the other hand, it demonstrates “shift-invariance” of the lower estimate. Note that the problem of estimating the smallest singular value of non-random shifts of square matrices is important in analysis of algorithms , , , .

Our proof of Theorem 1 is based on two key elements: on a modification of a standard ε\varepsilon-net argument for matrices (Proposition 3) and on estimates of the distance between a random vector and a fixed linear subspace that follow from a result of (Theorem 4 and Corollary 6 of our paper). Our method is similar in many aspects to the approach developed in and later in , . In particular, as in the mentioned papers, we decompose the unit sphere Sn1S^{n-1} into several subsets which are studied separately from one another. On the other hand, our modification of the ε\varepsilon-net argument and its technical realization in regard to splitting a random matrix into “regular” and “non-regular” parts are apparently new.

We will discuss the main idea of the proof more concretely and in more detail at the end of the next section, after we define notation and state the modified ε\varepsilon-net argument.

Preliminaries

Choose any ySn1y\in S^{n-1} and yNy^{\prime}\in{\mathcal{N}} such that yyε\|y-y^{\prime}\|\leq\varepsilon. Then

By taking the infimum over all ySn1y\in S^{n-1}, we obtain the result. ∎

Note that Lemma 2 cannot be used to handle matrices with the aspect ratio less than 22. Indeed, the lower estimate s_{n}(D)\geq\inf\limits_{y\in S^{n-1}}{\rm d}\bigl{(}{D_{1}}y,{\rm span}{D_{2}}\bigr{)} is non-trivial only if spanD1spanD2=0{\rm span}{D_{1}}\cap{\rm span}{D_{2}}=0, which is not true when N<2nN<2n and both D1{D_{1}} and D2{D_{2}} have full rank. The following strengthening of Lemma 2 resolves the problem:

and 3)3) for any ySy\in S there is yNy^{\prime}\in{\mathcal{N}} such that

Take any ySy\in S and let yNy^{\prime}\in{\mathcal{N}} be such that ProjEy(y)yε\|{\rm Proj}_{E_{y^{\prime}}}(y)-y^{\prime}\|\leq\varepsilon. Then

Taking the infimum over SS, we get the result. ∎

Note that for N=1N=1 the above definition is consistent with that given in the introduction. The following result is proved by M. Rudelson and R. Vershynin in :

where C\refRVconclemma>0C_{\ref{RV conc lemma}}>0 is a (sufficiently large) universal constant.

This theorem gives a nontrivial estimate for concentration only for η\eta sufficiently close to zero. Below, we provide an elementary estension of this result covering the case of “more concentrated” coordinates. First, let us recall a theorem of B. Rogozin:

where C\refrogozinlemma>0C_{\ref{rogozin lemma}}>0 is a universal constant.

Now, an easy application of Theorems 4 and 5 gives

Let X=(X1,X2,,Xm)X=(X_{1},X_{2},\dots,X_{m}) be a random vector with independent coordinates such that

and via the definition of SS we get the statement. ∎

where the h\refpeakylemma,w\refpeakylemma>0h_{\ref{peaky lemma}},w_{\ref{peaky lemma}}>0 depend only on γ\gamma and δ\delta.

for some h,w>0h,w>0 depending only on γ\gamma. Let

We adopt the following notation: For any subset W{1,2,,N}×{1,2,,n}W\subset\{1,2,\dots,N\}\times\{1,2,\dots,n\} let

Note that for all ωΩW\omega\in\Omega_{W} we have (aij)H(ω)=0{(a_{ij})}_{\overline{H}}(\omega)=0 for (i,j)W(i,j)\in W and (aij)H(ω)=0{(a_{ij})}_{H}(\omega)=0 for (i,j)W(i,j)\notin W. Hence, to verify (4) it is sufficient to prove that nNnN variables

so aija_{ij} (1iN1\leq i\leq N, 1jn1\leq j\leq n) are conditionally independent given ΩW\Omega_{W}. ∎

Define MM as the collection of all subsets W{1,2,,N}×{1,2,,n}W\subset\{1,2,\dots,N\}\times\{1,2,\dots,n\} satisfying

with \tau=\frac{1}{2}\bigl{(}\delta^{-1/4}-\delta^{-1/3}\bigr{)}. Then

where w\refconcinmatrix>0w_{\ref{conc in matrix}}>0 depends only on δ\delta.

For each i=1,2,,Ni=1,2,\dots,N and J{1,2,,n}J\subset\{1,2,\dots,n\} let

It is not difficult to see that the events EiΩ\mathcal{E}_{i}\subset\Omega (i=1,2,,Ni=1,2,\dots,N) are independent in view of independence of the entries of AA.

Fix for a moment any i{1,2,,N}i\in\{1,2,\dots,N\}. One can verify that for any j{1,2,,n}j\in\{1,2,\dots,n\} and J{1,2,,n}J\subset\{1,2,\dots,n\} the variables (aij)H{(a_{ij})}_{H} and (aij)H{(a_{ij}^{\prime})}_{H} are i.i.d. given event ΩJi\Omega_{J}^{i}. It follows that

Take any subset J{1,2,,n}J\subset\{1,2,\dots,n\} satisfying

Thus, any JJ satisfying (7) belongs to LiL_{i}. Clearly,

hence WMW\subset M. The argument implies EWMΩW\mathcal{E}\subset\bigcup_{W\in M}\Omega_{W} and the result follows. ∎

Next, we combine the result of Lemma 9 with Corollary 6:

Let N,n,δN,n,\delta, HH, A,AA,A^{\prime}, yy and ss be exactly as in Lemma 9 and BB be a non-random N×nN\times n matrix. Then

where E=span{ej}jsuppyE={\rm span}\{e_{j}\}_{j\in{\rm supp}y} and h\refwrapsignumlemma>0h_{\ref{wrap signum lemma}}>0, w\refwrapsignumlemma>0w_{\ref{wrap signum lemma}}>0 depend only on δ\delta.

Let MM and τ\tau be defined as in Lemma 9 and take any WMW\in M. Let

By Lemma 8, the subspace VA,B(H,E)=(A+B)(E)+(AH+B)(E)V_{A,B}(H,E)=(A+B)(E^{\perp})+({A}_{\overline{H}}+B)(E) and the vector AHy{A}_{H}y are conditionally independent given ΩW\Omega_{W}, hence the above estimate immediately implies

Since the relation holds for all WMW\in M, in view of Lemma 9 we obtain

Finally, we can prove the main result of the section:

Let A=(aij)A^{\prime}=(a_{ij}^{\prime}) be an N×nN\times n random matrix having the same distribution as AA such that 22-dimensional vectors (aij,aij)(a_{ij},a_{ij}^{\prime}) (1iN1\leq i\leq N, 1jn1\leq j\leq n) are i.i.d. and for any admissible ii and jj the variables aija_{ij} and aija_{ij}^{\prime} are conditionally i.i.d. given event {ωΩ:aij(ω)H}\{\omega\in\Omega:\,a_{ij}(\omega)\in H\} and identical on {ωΩ:aij(ω)H}\{\omega\in\Omega:\,a_{ij}(\omega)\in\overline{H}\}. For every i=1,2,,Ni=1,2,\dots,N and j=1,2,,nj=1,2,\dots,n, by the formula for the joint distribution of aija_{ij} and aija_{ij}^{\prime} we get

hence, in view of symmetric distribution of (aij)H(aij)H{(a_{ij})}_{H}-{(a_{ij}^{\prime})}_{H}, we have {\mathcal{Q}}\bigl{(}{(a_{ij})}_{H}-{(a_{ij}^{\prime})}_{H},\frac{d}{2}\bigr{)}\leq 1-\frac{r}{2}. Clearly, h\refdistanceestimatedyj2h_{\ref{distance estimate}}\geq\frac{d|y_{j}|}{2} for every coordinate yjy_{j} of the vector yy, hence by Theorem 5 for all i=1,2,,Ni=1,2,\dots,N

Thus, vector yy satisfies condition (5) with s:=h\refdistanceestimates:=h_{\ref{distance estimate}}. Then, by Lemma 10,

In our proof of Theorem 1, we represent Sn1S^{n-1} as the union of three subsets:

where θ\theta is a function of the parameters β\beta and δ\delta of the theorem. Then the smallest singular value of A+BA+B can be estimated by bounding separately infyAy+By\inf\limits_{y}\|Ay+By\| over each of the three subsets.

In our representation of Sn1S^{n-1}, we follow an idea from , where the unit sphere was split into sets of “close to sparse” and “far from sparse” vectors. A similar splitting was also employed in , , where the terms “compressible” and “incompressible” were used instead. On the other hand, our “borderline” N\sqrt{N} is smaller by the order of magnitude than in the mentioned papers.

The next elementary lemma shall be used in conjunction with Proposition 3.

Then there is a finite set NT{\mathcal{N}}\subset T of cardinality at most \bigl{(}\frac{C_{\ref{net sparsification lemma}}n}{\varepsilon m}\bigr{)}^{m} such that for any ySy\in S there is y=y(y)Ny^{\prime}=y^{\prime}(y)\in{\mathcal{N}} with yχsuppyyε\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\varepsilon.

For any J{1,2,,n}J\subset\{1,2,\dots,n\} with Jm|J|\leq m, let NJ{\mathcal{N}}_{J} be an ε\varepsilon-net for Tspan{ei}iJT\cap{\rm span}\{e_{i}\}_{i\in J} of cardinality at most \bigl{(}\frac{3}{\varepsilon}\bigr{)}^{m}. Define N{\mathcal{N}} as the union of NJ{\mathcal{N}}_{J} for all admissible JJ. Then, obviously,

Next, fix any ySy\in S and let xTx\in T be such that yχsuppx=xy\chi_{{\rm supp}x}=x. Since suppxm|{\rm supp}x|\leq m, there is yNsuppxNy^{\prime}\in{\mathcal{N}}_{{\rm supp}x}\subset{\mathcal{N}} with xyε\|x-y^{\prime}\|\leq\varepsilon. It remains to note that since suppysuppx{\rm supp}y^{\prime}\subset{\rm supp}x, necessarily yχsuppyyyχsuppxy=xyε\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\|y\chi_{{\rm supp}x}-y^{\prime}\|=\|x-y^{\prime}\|\leq\varepsilon. ∎

Then for the set S=San1(N)Spn1(θ\refcompressiblelemma)S=S_{a}^{n-1}(\sqrt{N})\setminus S_{p}^{n-1}(\theta_{\ref{compressible lemma}}) and any non-random N×nN\times n matrix BB

Fix any γ>0\gamma>0 and δ>1\delta>1 and define d:=2d:=2, r:=γr:=\gamma, t:=12t:=\frac{1}{2}; let h\refdistanceestimateh_{\ref{distance estimate}} be as in Proposition 11 and N\refcompressiblelemma=N\refcompressiblelemma(γ,δ)N_{\ref{compressible lemma}}=N_{\ref{compressible lemma}}(\gamma,\delta) be the smallest integer greater than 2h\refwrapsignumlemmah\refdistanceestimate\frac{2}{h_{\ref{wrap signum lemma}}h_{\ref{distance estimate}}} such that for all NN\refcompressiblelemmaN\geq N_{\ref{compressible lemma}}

Let Ey=span{ej}jsuppyE_{y^{\prime}}={\rm span}\{e_{j}\}_{j\in{\rm supp}y^{\prime}} (yNy^{\prime}\in{\mathcal{N}}) and define an event

In view of Proposition 11, the upper estimate for N|{\mathcal{N}}| and the definition of N\refcompressiblelemmaN_{\ref{compressible lemma}}

Take any ωE\omega\in\mathcal{E} and define D1=AH(ω){D_{1}}={A}_{H}(\omega), D2=AH(ω)+B{D_{2}}={A}_{\overline{H}}(\omega)+B, D=D1+D2D={D_{1}}+{D_{2}}. Since all entries of D1{D_{1}} are bounded by N\sqrt{N} by absolute value, we get D1N3/2\|{D_{1}}\|\leq N^{3/2}; next, for every yNy^{\prime}\in{\mathcal{N}}

(note that D(Ey)+D2(Ey)=VA,B(H,Ey)(ω)D(E_{y^{\prime}}^{\perp})+{D_{2}}(E_{y^{\prime}})=V_{A,B}(H,E_{y^{\prime}})(\omega)). Hence, by Proposition 3, we get

Finally, applying the above argument to all ωE\omega\in\mathcal{E}, we get the result. ∎

As we noted before, construction of the set HH corresponding to Sn1San1(N)S^{n-1}\setminus S^{n-1}_{a}(\sqrt{N}) is not so trivial as in the case of almost N\sqrt{N}-sparse vectors. The reason is that in general the set Sn1San1(N)S^{n-1}\setminus S^{n-1}_{a}(\sqrt{N}) is much larger than San1(N)S^{n-1}_{a}(\sqrt{N}), and we have to apply more delicate arguments to get a satisfactory probabilistic estimate. The construction of HH for the set of “far from N\sqrt{N}-sparse” vectors is contained in the following lemma:

Let us recall a folklore estimate of the norm of a random matrix with bounded mean zero entries (see, for example, [11, Proposition 2.4]):

Let W=(wij)W=(w_{ij}) be an N×nN\times n (NnN\geq n) random matrix with i.i.d. mean zero entries; R>0R>0 and assume that wijR|w_{ij}|\leq R a.s. Then for a universal constant C\reflaplacetransformlemma>0C_{\ref{laplace transform lemma}}>0

The next lemma highlights a useful property of the vectors from Sn1San1(N)S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N}):

For any integer Nnm1N\geq n\geq m\geq 1 and any ySn1San1(N)y\in S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N}) there is a set J=J(y){1,2,,n}J=J(y)\subset\{1,2,\dots,n\} such that Jm|J|\leq m, yχJ12mn\|y\chi_{J}\|\geq\frac{1}{2}\sqrt{\frac{m}{n}} and yχJ1N1/4\|y\chi_{J}\|_{\infty}\leq\frac{1}{\lfloor N^{1/4}\rfloor}.

Take any Nnm1N\geq n\geq m\geq 1 and y=(y1,y2,,yn)Sn1San1(N)y=(y_{1},y_{2},\dots,y_{n})\in S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N}) and let

Obviously, JnN>0|J^{\prime}|\geq n-\sqrt{N}>0 and, since yy is not almost N\sqrt{N}-sparse, yχJ3/4\|y\chi_{J^{\prime}}\|\geq\sqrt{3/4}. Let {J1,J2,,Jp}\{J^{\prime}_{1},J^{\prime}_{2},\dots,J^{\prime}_{p}\} be any partition of JJ^{\prime} into pairwise disjoint subsets of cardinality at most mm with pn/mp\leq\lceil n/m\rceil. Then, clearly, for some q{1,2,,p}q\in\{1,2,\dots,p\}, yχJqyχJ/p>12mn\|y\chi_{J_{q}}\|\geq\|y\chi_{J^{\prime}}\|/\sqrt{p}>\frac{1}{2}\sqrt{\frac{m}{n}}. Setting, J(y)=JqJ(y)=J_{q}, we get the result. ∎

Fix any γ>0\gamma>0 and δ>1\delta>1. To make the notation more compact, denote f0:=(1δ1/4)c\refintervaldetectionlemmaγC\refrogozinlemmaf_{0}:=\frac{(1-\delta^{-1/4})\sqrt{c_{\ref{interval detection lemma}}\gamma}}{C_{\ref{rogozin lemma}}} and let τ0=τ0(γ,δ)\tau_{0}=\tau_{0}(\gamma,\delta) be the largest number in (0,1](0,1] such that for all s0s\geq 0

(it is not difficult to see that τ0\tau_{0} is well defined). Then, take N\refincompressiblevectorslemma=N\refincompressiblevectorslemma(γ,δ)N_{\ref{incompressible vectors lemma}}=N_{\ref{incompressible vectors lemma}}(\gamma,\delta) to be the smallest positive integer such that for all NN\refincompressiblevectorslemmaN\geq N_{\ref{incompressible vectors lemma}}

Let NN\refincompressiblevectorslemmaN\geq N_{\ref{incompressible vectors lemma}}, NδnN\geq\delta n and let AA be an N×nN\times n random matrix with entries satisfying conditions of the lemma and BB be any non-random N×nN\times n matrix.

where h\refdistanceestimateh_{\ref{distance estimate}} is defined as in Proposition 11. Assume that SS is non-empty and let TB2nT\subset B_{2}^{n} consist of all mm-sparse vectors yB2ny\in B_{2}^{n} with yt\|y\|\geq t and y2h\refdistanceestimated\|y\|_{\infty}\leq\frac{2h_{\ref{distance estimate}}}{d}. The first inequality in (9) and a simple calculation show that 1N1/42h\refdistanceestimated\frac{1}{\lfloor N^{1/4}\rfloor}\leq\frac{2h_{\ref{distance estimate}}}{d}. Hence, in view of Lemma 16, TT is non-empty and satisfies (8). By Lemma 12, there is a finite subset NT{\mathcal{N}}\subset T of cardinality at most \bigl{(}\frac{nC_{\ref{net sparsification lemma}}}{m\varepsilon}\bigr{)}^{m} such that for any ySy\in S there is y=y(y)Ny^{\prime}=y^{\prime}(y)\in{\mathcal{N}} with yχsuppyyε\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\varepsilon.

For each yNy^{\prime}\in{\mathcal{N}} denote Ey=span{ej}jsuppyE_{y^{\prime}}={\rm span}\{e_{j}\}_{j\in{\rm supp}y^{\prime}}. By Proposition 11,

By the above probability estimates and Lemma 15,

Using the definition of ε\varepsilon, mm, τ0\tau_{0} and the second inequality in (9), we can estimate the probability as

Hence, by Proposition 3 and the definition of ε\varepsilon, we get

Finally, applying the above argument to entire set E\mathcal{E}, we obtain the result. ∎

In view of the trivial identity Q(aij,α)=Q(aij/α,1){\mathcal{Q}}(a_{ij},\alpha)={\mathcal{Q}}(a_{ij}/\alpha,1), it is enough to prove the theorem for α=1\alpha=1. Fix any δ>0\delta>0 and β>0\beta>0, let γ=β/4\gamma=\beta/4 and let N0=N0(β,δ)N_{0}=N_{0}(\beta,\delta) be the smallest integer such that N0max(N\refcompressiblelemma,N\refincompressiblevectorslemma)N_{0}\geq\max(N_{\ref{compressible lemma}},N_{\ref{incompressible vectors lemma}}) and for all NN0N\geq N_{0}

By Propositions 13 and 17 for S=San1(N)Spn1(θ\refcompressiblelemma)S=S_{a}^{n-1}(\sqrt{N})\setminus S_{p}^{n-1}(\theta_{\ref{compressible lemma}}) and S=Sn1San1(N)S^{\prime}=S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N}) we have

Combining the estimates, we get for h=\min\bigl{(}h_{\ref{peaky lemma}}\theta_{\ref{compressible lemma}},h_{\ref{compressible lemma}},h_{\ref{incompressible vectors lemma}}\bigr{)}:

Acknowledgement

I would like to thank my supervisor Prof. N. Tomczak-Jaegermann for valuable suggestions that helped improve structure of the proof.

References