The limit of the smallest singular value of random matrices with i.i.d. entries

Konstantin Tikhomirov

Introduction

For $N\geq m$ and an $N\times m$ real-valued matrix $B$ , its singular values $s_{1}(B)$ , $s_{2}(B),\dots$ , $s_{m}(B)$ are the eigenvalues of the matrix $\sqrt{B^{T}B}$ arranged in non-increasing order, where multiplicities are counted. In particular, the largest and the smallest singular values are given by

In this paper, we establish convergence of the smallest singular values of a sequence random matrices with i.i.d. entries under minimal moment assumptions.

The extreme singular values of random matrices attract considerable attention of researchers both in limiting and non-limiting settings. We refer the reader to surveys and monographs , , , for extensive information on the spectral theory of random matrices. Here, we shall focus on the following specific question: for matrices with i.i.d. entries, what are the weakest possible assumptions on the entries which are sufficient for the smallest singular value to “concentrate”?

We note that a corresponding problem for the largest singular value (i.e. the operator norm) was essentially resolved in the i.i.d. case, where finiteness of the fourth moment of the entries turns out to be crucial both in limiting and non-limiting settings. We refer the reader to and for results on a.s. convergence of the largest singular value, and for the non-limiting case (see also , for some negative results on concentration of the operator norm).

Further, it is proved in , that for square $m\times m$ matrices with i.i.d. centered entries with unit variance and a bounded fourth moment, one has $s_{m}(A)\approx m^{-1/2}$ with a large probability.

A natural question in connection with the mentioned results is whether the assumption on the fourth moment is necessary for the least singular value to “concentrate”; in particular, whether any assumptions on moments of $a_{ij}$ ’s higher than the $2$ -nd are required for the a.s. convergence in the Bai–Yin theorem. This question is discussed in on p. 6. Solving the problem was a motivation for our work.

A considerable progress has been made recently in the direction of weakening the moment assumptions on matrix entries. For square matrices, given a sufficiently large $m$ and an $m\times m$ matrix with i.i.d. entries with zero mean and unit variance, its smallest singular value is bounded from below by a constant (negative) power of $m$ with probability close to one [19, Theorem 2.1] (see also [5, Theorem 4.1] for sparse matrices).

The result of can be used to show that in the limiting setup of the Bai–Yin theorem but without the assumptions on moments higher than the $2$ -nd, the sequence $\bigl{(}{N_{m}}^{-1/2}s_{m}(A_{m})\bigr{)}_{m=1}^{\infty}$ satisfies

where $r$ is a certain function of $z=\lim m/N_{m}$ and the distribution of $a_{ij}$ ’s. The same conclusion can be derived from [6, Theorem 1.4], if we additionally assume that the limiting aspect ratio $z$ is bounded from above by a sufficiently small positive quantity (i.e. the matrices are tall). However, both [20, Theorem 1] and [6, Theorem 1.4] do not give the precise asymptotics.

This problem is resolved in our paper. The main result is the following

Theorem 1 in a strong form establishes the asymmetry of the limiting behaviour of the extreme singular values: whereas the fourth moment is necessary for the operator norm, the second moment is sufficient for the convergence of the smallest singular value.

which implies the result. Thus, the argument of the paper remains the crucial element of the proof, although we apply it only to the truncated variables, for which all positive moments are bounded. Let us emphasize that, whereas a truncation procedure for matrices also appears as a technical step in , in our approach the truncation level $M$ is not a function of $m$ .

Preliminaries

In this section, we introduce notation and present some classical or elementary facts, which we include for an easier referencing.

The next statement, which is sometimes called the Bernstein (or Hoeffding’s) inequality, can be derived from classical Khintchine’s inequality for the sum of weighted independent signs by a symmetrization procedure:

where $c_{\ref{Khintchine mod}}>0$ is a universal constant.

The lemma below is a law of large numbers, where instead of the arithmetic mean of a collection of random variables we consider more general weighted sums. As in the case of the classical weak LLN, the statement can be proved by applying Levy’s continuity theorem for characteristic functions.

Let $a_{1},a_{2},\dots$ be i.i.d. random variables with zero mean. Then for any $\varepsilon>0$ there is $\delta>0$ depending only on $\varepsilon$ and the distribution of $a_{j}$ ’s with the following property: whenever $(t_{j})_{j=1}^{\infty}$ is a sequence of non-negative real numbers such that $\sum_{j=1}^{\infty}t_{j}=1$ and $\max t_{j}\leq\delta$ , we have

where $r=(1-\sqrt{z})^{2}$ and $R=(1+\sqrt{z})^{2}$ .

Note that the above theorem does not require any assumptions on moments higher than the $2$ nd, and so can be applied in our setting. For our proof, we will actually need a much weaker result than Theorem 6, namely, that $\limsup\limits_{m\to\infty}\frac{s_{m}(A_{m})}{\sqrt{N_{m}}}\leq 1-\sqrt{z}$ almost surely. The latter can be immediately verified with help of Theorem 6: for every fixed $t>(1-\sqrt{z})^{2}$ , we have $\lim\limits_{m\to\infty}F^{T_{m}}(t)=F_{MP}(t)>0$ with probability one, hence the smallest non-zero eigenvalues $\lambda_{\min}(T_{m})$ of matrices $T_{m}$ satisfy $\limsup\limits_{m\to\infty}\lambda_{\min}(T_{m})\leq t$ a.s. This implies $\limsup\limits_{m\to\infty}\frac{s_{m}(A_{m})}{\sqrt{N_{m}}}\leq\sqrt{t}$ a.s., which gives the required estimate by letting $t\to(1-\sqrt{z})^{2}$ .

Norms of coordinate projections of random vectors

The goal of this section is to show that, given a sufficiently large random $N\times n$ matrix $A$ with i.i.d. entries with zero mean and unit variance, the quantity

is of order $\sqrt{N}$ with a very large probability (the probability shall depend on $\varepsilon>0$ ). It shall act as a “replacement” of the matrix norm $\|A\|$ which in our setting may be greater than $\sqrt{N}$ by the order of magnitude with probability close to one. We remark here that a quantity

where $m\leq N$ and $D$ is an $N\times n$ random matrix with i.i.d. isotropic log-concave rows, played a crucial role in the paper by Adamczak, Litvak, Pajor and Tomczak-Jaegermann, dealing with the problem of approximating covariance matrix of a log-concave random vector by the sample covariance matrix. In our case, however, the latter quantity is inapplicable as it may not concentrate near $\sqrt{N}$ (even for small $m$ ).

First, we prove the required estimate for (1) under the additional assumption that the entries of $A$ are symmetrically distributed (Lemma 12). Then we generalize the result to non-symmetric distributions in Proposition 13. Lemmas 7–11 given below build the framework of the proof.

For each $\varepsilon\in(0,1]$ there is $N_{\ref{single vector est}}=N_{\ref{single vector est}}(\varepsilon)>0$ depending only on $\varepsilon$ with the following property: let $N\geq N_{\ref{single vector est}}$ and let $X=(X_{1},X_{2},\dots,X_{N})$ be a random vector of independent variables, each $X_{i}$ having zero mean and unit variance. Then

with probability at least $1-\exp(-c_{\ref{single vector est}}\varepsilon N)$ , where $C_{\ref{single vector est}},c_{\ref{single vector est}}>0$ are universal constants.

Fix any $\varepsilon\in(0,1]$ and define $N_{\ref{single vector est}}$ as the smallest positive integer such that

for all $N\geq N_{\ref{single vector est}}$ . Choose any $N\geq N_{\ref{single vector est}}$ and let $X$ be as stated above. Set $M=\frac{4}{\varepsilon}$ . In view of Markov’s inequality,

Finally, using the definition of $N_{\ref{single vector est}}$ , we get

Fix any $K>0$ and let $N,n$ and $A=(a_{ij})$ be as stated above. Let $r_{ij}$ $(1\leq i\leq N,\;1\leq j\leq n)$ be Rademacher variables jointly independent with $A$ , and let $\bar{A}$ denote the random $N\times n$ matrix $(r_{ij}a_{ij})$ . Then, since $a_{ij}$ ’s are symmetrically distributed, for any fixed vector $y=(y_{1},y_{2},\dots,y_{n})\in S^{n-1}$ the distribution of $\|{\rm Proj}_{I_{y}}Ay\|$ is the same as that of $\|{\rm Proj}_{I_{y}}\bar{A}y\|$ . Define a subset of (non-random) $N\times n$ matrices:

and for every $B=(b_{ij})\in{\mathcal{M}}_{y}$ denote by $\bar{B}$ the random matrix $(r_{ij}b_{ij})$ . Note that at every point $\omega$ of the probability space the matrix ${\rm Proj}_{I_{y}(\omega)}\bar{A}(\omega)$ belongs to ${\mathcal{M}}_{y}$ . Then, conditioning on $a_{ij}$ ’s, we get for every $\tau>0$ :

Note that for each $B\in{\mathcal{M}}_{y}$ and $i\leq N$ , the $i$ -th coordinate of the vector $\bar{B}y$ satisfies in view of Lemma 4:

A standard application of the Laplace transform then yields

for some $L_{\ref{vector bound sym}}>0$ depending only on $K$ . This, together with (2), proves the result. ∎

As an elementary consequence of Lemmas 8 and 9 we get

Fix any $K>0$ and $\varepsilon>0$ and define $n_{\ref{cube bound}}=\lceil\delta_{\ref{card of Iy}}(\varepsilon,K+1)^{-2}\rceil$ , where $\delta_{\ref{card of Iy}}>0$ is taken from Lemma 9. Now, choose any $N\geq n\geq n_{\ref{cube bound}}$ and let $A=(a_{ij})$ be an $N\times n$ random matrix with i.i.d. entries distributed as $\xi$ . Let $V$ be the set of vertices of the cube $\frac{1}{\sqrt{n}}B_{\infty}^{n}=[-\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}]^{n}$ . In view of Lemma 9, any $v\in V$ satisfies

Next, by Lemma 8, for $L=L_{\ref{vector bound sym}}(K+2)>0$ we have

for all $v\in V$ . Note that for any $u,v\in V$ the random sets $I_{u}$ and $I_{v}$ coincide everywhere on $\Omega$ . Hence, together with the above estimates, we get

It remains to note that for any $I\subset\{1,2,\dots,N\}$ and $y\in B_{\infty}^{n}$ we have

In the following statement, we bound the quantity (1) assuming that the matrix entries are symmetrically distributed. The lemmas above provide estimates for $\min\limits_{|I|\geq N-\varepsilon N}\|{\rm Proj}_{I}Ay\|$ for individual vectors on the sphere as well as an upper bound on the cube $\frac{1}{\sqrt{n}}B_{\infty}^{n}$ . To derive an estimate for the supremum over the sphere, we shall embed $S^{n-1}$ into Minkowski sum of a multiple of $B_{\infty}^{n}$ and two specially chosen finite sets (see (3) in the proof below). This way each vector $y\in S^{n-1}$ can be “decomposed” as a sum of three vectors with particular characterestics. This approach is similar to splitting the unit sphere into sets of “close to sparse” and “far from sparse” vectors introduced in and subsequently used in , .

where $C_{\ref{weak lsv sym}}>0$ is a universal constant.

Fix $\varepsilon\in(0,1]$ and let $N_{\ref{weak lsv sym}}$ be the smallest integer such that

$\lfloor N_{\ref{weak lsv sym}}^{1/4}\rfloor\delta_{\ref{single v bound}}(\varepsilon/3,2C_{\ref{cubic net}})\geq 1$ ;

$N_{\ref{weak lsv sym}}\geq\max\bigl{(}N_{\ref{single vector est}}(\varepsilon/3),n_{\ref{cube bound}}(\varepsilon/3,1)\bigr{)}$ ;

Choose $N\geq N_{\ref{weak lsv sym}}$ . Without loss of generality, we can assume that $n=N$ . Let $A$ be as stated above.

By Lemma 3, there is a finite subset ${\mathcal{N}}_{2}\subset T$ of cardinality at most $\exp(C_{\ref{cubic net}}N)$ such that for any $y\in T$ there is $y^{\prime}\in{\mathcal{N}}_{2}$ with $\|y-y^{\prime}\|_{\infty}\leq N^{-1/2}$ .

so $y^{3}\in\frac{2}{\sqrt{N}}B_{\infty}^{N}$ . This proves (3).

For each $y^{1}\in{\mathcal{N}}_{1}$ , in view of Lemma 7 and the condition $N\geq N_{\ref{single vector est}}(\varepsilon/3)$ , we have

Next, for every $y^{2}\in{\mathcal{N}}_{2}$ , Lemma 10 together with the inequality $\lfloor N^{1/4}\rfloor\delta_{\ref{single v bound}}(\varepsilon/3,2C_{\ref{cubic net}})\geq 1$ and $\|y^{2}\|_{\infty}\leq 1/\lfloor N^{1/4}\rfloor$ implies that

for some constant $L_{\ref{single v bound}}>0$ . Finally, by Lemma 11 and in view of the condition $N\geq n_{\ref{cube bound}}(\varepsilon/3,1)$ we have

where $L_{\ref{cube bound}}>0$ is a universal constant. Let $\mathcal{E}$ denote the event

Then from the above probability estimates and the definition of $N_{\ref{weak lsv sym}}$ we obtain

where $w_{\ref{weak lsv sym}}=\min\bigl{(}\frac{c_{\ref{single vector est}}\varepsilon}{6},\frac{C_{\ref{cubic net}}}{2},\frac{1}{2}\bigr{)}$ .

Note that the intersection $I=I_{1}\cap I_{2}\cap I_{3}$ necessarily satisfies $|I|\geq N-\varepsilon N$ , and from the last inequalities we get $\|{\rm Proj}_{I}A(\omega)y\|\leq(2C_{\ref{single vector est}}+L_{\ref{single v bound}}+2L_{\ref{cube bound}})\sqrt{N}$ . Since our choice of $y\in S^{N-1}$ and $\omega\in\mathcal{E}$ was arbitrary, we get

Finally, we can state the main result of the section.

where $C_{\ref{weak lsv nonsym}}>0$ is a universal constant.

Hence, taking into consideration that the entries of $A-A^{\prime}$ are distributed as $\xi-\xi^{\prime}$ and using Lemma 12, we get

Matrix truncation and proof of Theorem 1

In the next statement, we compare the $n$ -th largest singular value of a random $N\times n$ matrix $A$ with bounded entries to $s_{n}({\rm Proj}_{I}A)$ . Obviously,

We will need an inequality in the opposite direction when $|I|/N\approx 1$ . A theorem of Litvak, Pajor, Rudelson and Tomczak-Jaegermann from implies that for any $\delta>1$ and $M>0$ there are $h>0$ and $\varepsilon>0$ depending only on $\delta$ and $M$ with the following property: whenever $N\geq\delta n$ and $A$ is an $N\times n$ random matrix with i.i.d. entries with mean zero, variance one and a.s. bounded by $M$ , we have

This, together with an upper bound for $s_{n}(A)$ , gives an estimate

with a large probability, where $L>0$ depends only on $\delta$ and $M$ . However, such an estimate would be insufficient for our needs, and we shall apply a more direct argument to get a stronger relation.

Fix any $\eta>0$ , let $\varepsilon=\varepsilon_{\ref{ssv of submatr}}(\eta,M)$ be the largest number in $(0,1]$ satisfying

Let $N\geq N_{\ref{ssv of submatr}}$ , $n\leq N$ and $A$ be an $N\times n$ random matrix defined as above. We shall prove the statement by contradiction. Let us assume that

Cardinality of the set $T=\bigl{\{}I\subset\{1,2,\dots,N\}:\,|I|=\lceil N-\varepsilon N\rceil\bigr{\}}$ can be estimated as

Hence, our assumption implies that there is a set $I_{0}\in T$ such that

Now, for every $y=(y_{1},y_{2},\dots,y_{n})\in S^{n-1}$ , Lemma 4 and the standard procedure with the Laplace transform give for $\lambda=\frac{c_{\ref{Khintchine mod}}}{2M^{2}}$ :

Together with (4), the last estimate implies

However, this contradicts to our choice of $\varepsilon$ . Thus, the initial assumption was wrong, and the statement is proved. ∎

Let $\xi$ be a random variable with zero mean. Then for any $M>0$ we call the variable

the centered $M$ -truncation of $\xi$ . Here, $\chi_{\{|\xi|\leq M\}}$ is the indicator of the event $\bigl{\{}\omega\in\Omega:\,|\xi(\omega)|\leq M\bigr{\}}$ .

Thus, it suffices to prove the lower estimate

Now, choose arbitrary $\eta>0$ and let $M>0$ be such that

where the quantity on the right-hand side goes to $1$ as $k$ tends to infinity. Hence, we obtain

On the other hand, the theorem of Bai and Yin implies that

Since $\eta>0$ was arbitrary, this proves the result. ∎

Acknowledgement. I would like to thank my supervisor Dr. Nicole Tomczak-Jaegermann for support and for valuable suggestions on the text.

Introduction

Preliminaries

Norms of coordinate projections of random vectors

Matrix truncation and proof of Theorem 1

References