The smallest singular value of random rectangular matrices with no moment assumptions on entries

Konstantin E. Tikhomirov

Introduction

In last years, spectral properties of random matrices with fixed dimensions (the corresponding theory is often called non-asymptotic) have attracted considerable attention of researchers, whose efforts have been mostly concentrated on studying distributions of the largest and the smallest singular values. For detailed information on the development of the subject, we refer the reader to surveys , .

Let $N\geq n$ . Given an $N\times n$ random matrix $A$ , we employ a usual notation $s_{1}(A):=\max\limits_{y\in S^{n-1}}\|Ay\|$ ; $s_{n}(A):=\inf\limits_{y\in S^{n-1}}\|Ay\|$ . A limiting result of Z.D. Bai and Y.Q. Yin suggests that for an $N\times n$ matrix with i.i.d. mean zero entries with unit variance and a finite fourth moment, its largest and smallest singular values should “concentrate” near $\sqrt{N}+\sqrt{n}$ and $\sqrt{N}-\sqrt{n}$ , respectively. In the non-asymptotic setting one is interested, in particular, in finding the weakest possible conditions on random matrices that would imply $s_{1}\lesssim\sqrt{N}+\sqrt{n}$ and $s_{n}\gtrsim\sqrt{N}-\sqrt{n}$ with a large probability.

Various estimates for the extremal singular values were obtained when studying the problem of approximating covariance matrix of a random vector by the empirical covariance matrix. Answering a question of R. Kannan, L. Lovász and M. Simonovits, the authors of treated log-concave random vectors. Later, the log-concavity was replaced by weaker assumptions (see, in particular, , , , ).

The assumption of isotropicity of a random vector or, more generally, boundedness of variance of its coordinates is quite natural and appears as part of requirements on a matrix’ rows in all the aforementioned papers. However, for a deeper understanding of non-asymptotic characteristics of random matrices, an important question is whether any moment assumptions on entries are really necessary in order to get satisfactory lower estimates for the smallest singular value.

Unlike in and where the matrix entries within a given row are not necessarily independent, in our paper we consider the classical setting when a rectangular matrix has i.i.d. entries. However, in contrast with all the mentioned results, the lower estimate for the smallest singular value that we prove does not use any moment assumptions; the only requirement is that the distribution of entries satisfies a “spreading” condition given in terms of the Levy concentration function. Moreover, compared to and , we significantly relax the assumptions on the aspect ratio of the matrix.

Given a real random variable $\xi$ , the concentration function of $\xi$ is defined as

The main result of our paper is the following theorem:

Then for any non-random $N\times n$ matrix $B$ we have

Adding a non-random component $B$ in the theorem does not increase complexity of the proof; on the other hand, it demonstrates “shift-invariance” of the lower estimate. Note that the problem of estimating the smallest singular value of non-random shifts of square matrices is important in analysis of algorithms , , , .

Our proof of Theorem 1 is based on two key elements: on a modification of a standard $\varepsilon$ -net argument for matrices (Proposition 3) and on estimates of the distance between a random vector and a fixed linear subspace that follow from a result of (Theorem 4 and Corollary 6 of our paper). Our method is similar in many aspects to the approach developed in and later in , . In particular, as in the mentioned papers, we decompose the unit sphere $S^{n-1}$ into several subsets which are studied separately from one another. On the other hand, our modification of the $\varepsilon$ -net argument and its technical realization in regard to splitting a random matrix into “regular” and “non-regular” parts are apparently new.

We will discuss the main idea of the proof more concretely and in more detail at the end of the next section, after we define notation and state the modified $\varepsilon$ -net argument.

Preliminaries

Choose any $y\in S^{n-1}$ and $y^{\prime}\in{\mathcal{N}}$ such that $\|y-y^{\prime}\|\leq\varepsilon$ . Then

By taking the infimum over all $y\in S^{n-1}$ , we obtain the result. ∎

Note that Lemma 2 cannot be used to handle matrices with the aspect ratio less than $2$ . Indeed, the lower estimate $s_{n}(D)\geq\inf\limits_{y\in S^{n-1}}{\rm d}\bigl{(}{D_{1}}y,{\rm span}{D_{2}}\bigr{)}$ is non-trivial only if ${\rm span}{D_{1}}\cap{\rm span}{D_{2}}=0$ , which is not true when $N<2n$ and both ${D_{1}}$ and ${D_{2}}$ have full rank. The following strengthening of Lemma 2 resolves the problem:

and $3)$ for any $y\in S$ there is $y^{\prime}\in{\mathcal{N}}$ such that

Take any $y\in S$ and let $y^{\prime}\in{\mathcal{N}}$ be such that $\|{\rm Proj}_{E_{y^{\prime}}}(y)-y^{\prime}\|\leq\varepsilon$ . Then

Taking the infimum over $S$ , we get the result. ∎

Note that for $N=1$ the above definition is consistent with that given in the introduction. The following result is proved by M. Rudelson and R. Vershynin in :

where $C_{\ref{RV conc lemma}}>0$ is a (sufficiently large) universal constant.

This theorem gives a nontrivial estimate for concentration only for $\eta$ sufficiently close to zero. Below, we provide an elementary estension of this result covering the case of “more concentrated” coordinates. First, let us recall a theorem of B. Rogozin:

where $C_{\ref{rogozin lemma}}>0$ is a universal constant.

Now, an easy application of Theorems 4 and 5 gives

Let $X=(X_{1},X_{2},\dots,X_{m})$ be a random vector with independent coordinates such that

and via the definition of $S$ we get the statement. ∎

where the $h_{\ref{peaky lemma}},w_{\ref{peaky lemma}}>0$ depend only on $\gamma$ and $\delta$ .

for some $h,w>0$ depending only on $\gamma$ . Let

We adopt the following notation: For any subset $W\subset\{1,2,\dots,N\}\times\{1,2,\dots,n\}$ let

Note that for all $\omega\in\Omega_{W}$ we have ${(a_{ij})}_{\overline{H}}(\omega)=0$ for $(i,j)\in W$ and ${(a_{ij})}_{H}(\omega)=0$ for $(i,j)\notin W$ . Hence, to verify (4) it is sufficient to prove that $nN$ variables

so $a_{ij}$ ( $1\leq i\leq N$ , $1\leq j\leq n$ ) are conditionally independent given $\Omega_{W}$ . ∎

Define $M$ as the collection of all subsets $W\subset\{1,2,\dots,N\}\times\{1,2,\dots,n\}$ satisfying

with $\tau=\frac{1}{2}\bigl{(}\delta^{-1/4}-\delta^{-1/3}\bigr{)}$ . Then

where $w_{\ref{conc in matrix}}>0$ depends only on $\delta$ .

For each $i=1,2,\dots,N$ and $J\subset\{1,2,\dots,n\}$ let

It is not difficult to see that the events $\mathcal{E}_{i}\subset\Omega$ ( $i=1,2,\dots,N$ ) are independent in view of independence of the entries of $A$ .

Fix for a moment any $i\in\{1,2,\dots,N\}$ . One can verify that for any $j\in\{1,2,\dots,n\}$ and $J\subset\{1,2,\dots,n\}$ the variables ${(a_{ij})}_{H}$ and ${(a_{ij}^{\prime})}_{H}$ are i.i.d. given event $\Omega_{J}^{i}$ . It follows that

Take any subset $J\subset\{1,2,\dots,n\}$ satisfying

Thus, any $J$ satisfying (7) belongs to $L_{i}$ . Clearly,

hence $W\subset M$ . The argument implies $\mathcal{E}\subset\bigcup_{W\in M}\Omega_{W}$ and the result follows. ∎

Next, we combine the result of Lemma 9 with Corollary 6:

Let $N,n,\delta$ , $H$ , $A,A^{\prime}$ , $y$ and $s$ be exactly as in Lemma 9 and $B$ be a non-random $N\times n$ matrix. Then

where $E={\rm span}\{e_{j}\}_{j\in{\rm supp}y}$ and $h_{\ref{wrap signum lemma}}>0$ , $w_{\ref{wrap signum lemma}}>0$ depend only on $\delta$ .

Let $M$ and $\tau$ be defined as in Lemma 9 and take any $W\in M$ . Let

By Lemma 8, the subspace $V_{A,B}(H,E)=(A+B)(E^{\perp})+({A}_{\overline{H}}+B)(E)$ and the vector ${A}_{H}y$ are conditionally independent given $\Omega_{W}$ , hence the above estimate immediately implies

Since the relation holds for all $W\in M$ , in view of Lemma 9 we obtain

Finally, we can prove the main result of the section:

Let $A^{\prime}=(a_{ij}^{\prime})$ be an $N\times n$ random matrix having the same distribution as $A$ such that $2$ -dimensional vectors $(a_{ij},a_{ij}^{\prime})$ ( $1\leq i\leq N$ , $1\leq j\leq n$ ) are i.i.d. and for any admissible $i$ and $j$ the variables $a_{ij}$ and $a_{ij}^{\prime}$ are conditionally i.i.d. given event $\{\omega\in\Omega:\,a_{ij}(\omega)\in H\}$ and identical on $\{\omega\in\Omega:\,a_{ij}(\omega)\in\overline{H}\}$ . For every $i=1,2,\dots,N$ and $j=1,2,\dots,n$ , by the formula for the joint distribution of $a_{ij}$ and $a_{ij}^{\prime}$ we get

hence, in view of symmetric distribution of ${(a_{ij})}_{H}-{(a_{ij}^{\prime})}_{H}$ , we have ${\mathcal{Q}}\bigl{(}{(a_{ij})}_{H}-{(a_{ij}^{\prime})}_{H},\frac{d}{2}\bigr{)}\leq 1-\frac{r}{2}$ . Clearly, $h_{\ref{distance estimate}}\geq\frac{d|y_{j}|}{2}$ for every coordinate $y_{j}$ of the vector $y$ , hence by Theorem 5 for all $i=1,2,\dots,N$

Thus, vector $y$ satisfies condition (5) with $s:=h_{\ref{distance estimate}}$ . Then, by Lemma 10,

In our proof of Theorem 1, we represent $S^{n-1}$ as the union of three subsets:

where $\theta$ is a function of the parameters $\beta$ and $\delta$ of the theorem. Then the smallest singular value of $A+B$ can be estimated by bounding separately $\inf\limits_{y}\|Ay+By\|$ over each of the three subsets.

In our representation of $S^{n-1}$ , we follow an idea from , where the unit sphere was split into sets of “close to sparse” and “far from sparse” vectors. A similar splitting was also employed in , , where the terms “compressible” and “incompressible” were used instead. On the other hand, our “borderline” $\sqrt{N}$ is smaller by the order of magnitude than in the mentioned papers.

The next elementary lemma shall be used in conjunction with Proposition 3.

Then there is a finite set ${\mathcal{N}}\subset T$ of cardinality at most $\bigl{(}\frac{C_{\ref{net sparsification lemma}}n}{\varepsilon m}\bigr{)}^{m}$ such that for any $y\in S$ there is $y^{\prime}=y^{\prime}(y)\in{\mathcal{N}}$ with $\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\varepsilon$ .

For any $J\subset\{1,2,\dots,n\}$ with $|J|\leq m$ , let ${\mathcal{N}}_{J}$ be an $\varepsilon$ -net for $T\cap{\rm span}\{e_{i}\}_{i\in J}$ of cardinality at most $\bigl{(}\frac{3}{\varepsilon}\bigr{)}^{m}$ . Define ${\mathcal{N}}$ as the union of ${\mathcal{N}}_{J}$ for all admissible $J$ . Then, obviously,

Next, fix any $y\in S$ and let $x\in T$ be such that $y\chi_{{\rm supp}x}=x$ . Since $|{\rm supp}x|\leq m$ , there is $y^{\prime}\in{\mathcal{N}}_{{\rm supp}x}\subset{\mathcal{N}}$ with $\|x-y^{\prime}\|\leq\varepsilon$ . It remains to note that since ${\rm supp}y^{\prime}\subset{\rm supp}x$ , necessarily $\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\|y\chi_{{\rm supp}x}-y^{\prime}\|=\|x-y^{\prime}\|\leq\varepsilon$ . ∎

Then for the set $S=S_{a}^{n-1}(\sqrt{N})\setminus S_{p}^{n-1}(\theta_{\ref{compressible lemma}})$ and any non-random $N\times n$ matrix $B$

Fix any $\gamma>0$ and $\delta>1$ and define $d:=2$ , $r:=\gamma$ , $t:=\frac{1}{2}$ ; let $h_{\ref{distance estimate}}$ be as in Proposition 11 and $N_{\ref{compressible lemma}}=N_{\ref{compressible lemma}}(\gamma,\delta)$ be the smallest integer greater than $\frac{2}{h_{\ref{wrap signum lemma}}h_{\ref{distance estimate}}}$ such that for all $N\geq N_{\ref{compressible lemma}}$

Let $E_{y^{\prime}}={\rm span}\{e_{j}\}_{j\in{\rm supp}y^{\prime}}$ ( $y^{\prime}\in{\mathcal{N}}$ ) and define an event

In view of Proposition 11, the upper estimate for $|{\mathcal{N}}|$ and the definition of $N_{\ref{compressible lemma}}$

Take any $\omega\in\mathcal{E}$ and define ${D_{1}}={A}_{H}(\omega)$ , ${D_{2}}={A}_{\overline{H}}(\omega)+B$ , $D={D_{1}}+{D_{2}}$ . Since all entries of ${D_{1}}$ are bounded by $\sqrt{N}$ by absolute value, we get $\|{D_{1}}\|\leq N^{3/2}$ ; next, for every $y^{\prime}\in{\mathcal{N}}$

(note that $D(E_{y^{\prime}}^{\perp})+{D_{2}}(E_{y^{\prime}})=V_{A,B}(H,E_{y^{\prime}})(\omega)$ ). Hence, by Proposition 3, we get

Finally, applying the above argument to all $\omega\in\mathcal{E}$ , we get the result. ∎

As we noted before, construction of the set $H$ corresponding to $S^{n-1}\setminus S^{n-1}_{a}(\sqrt{N})$ is not so trivial as in the case of almost $\sqrt{N}$ -sparse vectors. The reason is that in general the set $S^{n-1}\setminus S^{n-1}_{a}(\sqrt{N})$ is much larger than $S^{n-1}_{a}(\sqrt{N})$ , and we have to apply more delicate arguments to get a satisfactory probabilistic estimate. The construction of $H$ for the set of “far from $\sqrt{N}$ -sparse” vectors is contained in the following lemma:

Let us recall a folklore estimate of the norm of a random matrix with bounded mean zero entries (see, for example, [11, Proposition 2.4]):

Let $W=(w_{ij})$ be an $N\times n$ ( $N\geq n$ ) random matrix with i.i.d. mean zero entries; $R>0$ and assume that $|w_{ij}|\leq R$ a.s. Then for a universal constant $C_{\ref{laplace transform lemma}}>0$

The next lemma highlights a useful property of the vectors from $S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N})$ :

For any integer $N\geq n\geq m\geq 1$ and any $y\in S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N})$ there is a set $J=J(y)\subset\{1,2,\dots,n\}$ such that $|J|\leq m$ , $\|y\chi_{J}\|\geq\frac{1}{2}\sqrt{\frac{m}{n}}$ and $\|y\chi_{J}\|_{\infty}\leq\frac{1}{\lfloor N^{1/4}\rfloor}$ .

Take any $N\geq n\geq m\geq 1$ and $y=(y_{1},y_{2},\dots,y_{n})\in S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N})$ and let

Obviously, $|J^{\prime}|\geq n-\sqrt{N}>0$ and, since $y$ is not almost $\sqrt{N}$ -sparse, $\|y\chi_{J^{\prime}}\|\geq\sqrt{3/4}$ . Let $\{J^{\prime}_{1},J^{\prime}_{2},\dots,J^{\prime}_{p}\}$ be any partition of $J^{\prime}$ into pairwise disjoint subsets of cardinality at most $m$ with $p\leq\lceil n/m\rceil$ . Then, clearly, for some $q\in\{1,2,\dots,p\}$ , $\|y\chi_{J_{q}}\|\geq\|y\chi_{J^{\prime}}\|/\sqrt{p}>\frac{1}{2}\sqrt{\frac{m}{n}}$ . Setting, $J(y)=J_{q}$ , we get the result. ∎

Fix any $\gamma>0$ and $\delta>1$ . To make the notation more compact, denote $f_{0}:=\frac{(1-\delta^{-1/4})\sqrt{c_{\ref{interval detection lemma}}\gamma}}{C_{\ref{rogozin lemma}}}$ and let $\tau_{0}=\tau_{0}(\gamma,\delta)$ be the largest number in $(0,1]$ such that for all $s\geq 0$

(it is not difficult to see that $\tau_{0}$ is well defined). Then, take $N_{\ref{incompressible vectors lemma}}=N_{\ref{incompressible vectors lemma}}(\gamma,\delta)$ to be the smallest positive integer such that for all $N\geq N_{\ref{incompressible vectors lemma}}$

Let $N\geq N_{\ref{incompressible vectors lemma}}$ , $N\geq\delta n$ and let $A$ be an $N\times n$ random matrix with entries satisfying conditions of the lemma and $B$ be any non-random $N\times n$ matrix.

where $h_{\ref{distance estimate}}$ is defined as in Proposition 11. Assume that $S$ is non-empty and let $T\subset B_{2}^{n}$ consist of all $m$ -sparse vectors $y\in B_{2}^{n}$ with $\|y\|\geq t$ and $\|y\|_{\infty}\leq\frac{2h_{\ref{distance estimate}}}{d}$ . The first inequality in (9) and a simple calculation show that $\frac{1}{\lfloor N^{1/4}\rfloor}\leq\frac{2h_{\ref{distance estimate}}}{d}$ . Hence, in view of Lemma 16, $T$ is non-empty and satisfies (8). By Lemma 12, there is a finite subset ${\mathcal{N}}\subset T$ of cardinality at most $\bigl{(}\frac{nC_{\ref{net sparsification lemma}}}{m\varepsilon}\bigr{)}^{m}$ such that for any $y\in S$ there is $y^{\prime}=y^{\prime}(y)\in{\mathcal{N}}$ with $\|y\chi_{{\rm supp}y^{\prime}}-y^{\prime}\|\leq\varepsilon$ .

For each $y^{\prime}\in{\mathcal{N}}$ denote $E_{y^{\prime}}={\rm span}\{e_{j}\}_{j\in{\rm supp}y^{\prime}}$ . By Proposition 11,

By the above probability estimates and Lemma 15,

Using the definition of $\varepsilon$ , $m$ , $\tau_{0}$ and the second inequality in (9), we can estimate the probability as

Hence, by Proposition 3 and the definition of $\varepsilon$ , we get

Finally, applying the above argument to entire set $\mathcal{E}$ , we obtain the result. ∎

In view of the trivial identity ${\mathcal{Q}}(a_{ij},\alpha)={\mathcal{Q}}(a_{ij}/\alpha,1)$ , it is enough to prove the theorem for $\alpha=1$ . Fix any $\delta>0$ and $\beta>0$ , let $\gamma=\beta/4$ and let $N_{0}=N_{0}(\beta,\delta)$ be the smallest integer such that $N_{0}\geq\max(N_{\ref{compressible lemma}},N_{\ref{incompressible vectors lemma}})$ and for all $N\geq N_{0}$

By Propositions 13 and 17 for $S=S_{a}^{n-1}(\sqrt{N})\setminus S_{p}^{n-1}(\theta_{\ref{compressible lemma}})$ and $S^{\prime}=S^{n-1}\setminus S_{a}^{n-1}(\sqrt{N})$ we have

Combining the estimates, we get for $h=\min\bigl{(}h_{\ref{peaky lemma}}\theta_{\ref{compressible lemma}},h_{\ref{compressible lemma}},h_{\ref{incompressible vectors lemma}}\bigr{)}$ :

Acknowledgement

I would like to thank my supervisor Prof. N. Tomczak-Jaegermann for valuable suggestions that helped improve structure of the proof.

Introduction

Preliminaries

Acknowledgement

References