The smallest singular value of random rectangular matrices with no moment assumptions on entries
Konstantin E. Tikhomirov
Introduction
In last years, spectral properties of random matrices with fixed dimensions (the corresponding theory is often called non-asymptotic) have attracted considerable attention of researchers, whose efforts have been mostly concentrated on studying distributions of the largest and the smallest singular values. For detailed information on the development of the subject, we refer the reader to surveys , .
Let . Given an random matrix , we employ a usual notation ; . A limiting result of Z.D. Bai and Y.Q. Yin suggests that for an matrix with i.i.d. mean zero entries with unit variance and a finite fourth moment, its largest and smallest singular values should “concentrate” near and , respectively. In the non-asymptotic setting one is interested, in particular, in finding the weakest possible conditions on random matrices that would imply and with a large probability.
Various estimates for the extremal singular values were obtained when studying the problem of approximating covariance matrix of a random vector by the empirical covariance matrix. Answering a question of R. Kannan, L. Lovász and M. Simonovits, the authors of treated log-concave random vectors. Later, the log-concavity was replaced by weaker assumptions (see, in particular, , , , ).
The assumption of isotropicity of a random vector or, more generally, boundedness of variance of its coordinates is quite natural and appears as part of requirements on a matrix’ rows in all the aforementioned papers. However, for a deeper understanding of non-asymptotic characteristics of random matrices, an important question is whether any moment assumptions on entries are really necessary in order to get satisfactory lower estimates for the smallest singular value.
Unlike in and where the matrix entries within a given row are not necessarily independent, in our paper we consider the classical setting when a rectangular matrix has i.i.d. entries. However, in contrast with all the mentioned results, the lower estimate for the smallest singular value that we prove does not use any moment assumptions; the only requirement is that the distribution of entries satisfies a “spreading” condition given in terms of the Levy concentration function. Moreover, compared to and , we significantly relax the assumptions on the aspect ratio of the matrix.
Given a real random variable , the concentration function of is defined as
The main result of our paper is the following theorem:
Then for any non-random matrix we have
Adding a non-random component in the theorem does not increase complexity of the proof; on the other hand, it demonstrates “shift-invariance” of the lower estimate. Note that the problem of estimating the smallest singular value of non-random shifts of square matrices is important in analysis of algorithms , , , .
Our proof of Theorem 1 is based on two key elements: on a modification of a standard -net argument for matrices (Proposition 3) and on estimates of the distance between a random vector and a fixed linear subspace that follow from a result of (Theorem 4 and Corollary 6 of our paper). Our method is similar in many aspects to the approach developed in and later in , . In particular, as in the mentioned papers, we decompose the unit sphere into several subsets which are studied separately from one another. On the other hand, our modification of the -net argument and its technical realization in regard to splitting a random matrix into “regular” and “non-regular” parts are apparently new.
We will discuss the main idea of the proof more concretely and in more detail at the end of the next section, after we define notation and state the modified -net argument.
Preliminaries
Choose any and such that . Then
By taking the infimum over all , we obtain the result. ∎
Note that Lemma 2 cannot be used to handle matrices with the aspect ratio less than . Indeed, the lower estimate s_{n}(D)\geq\inf\limits_{y\in S^{n-1}}{\rm d}\bigl{(}{D_{1}}y,{\rm span}{D_{2}}\bigr{)} is non-trivial only if , which is not true when and both and have full rank. The following strengthening of Lemma 2 resolves the problem:
and for any there is such that
Take any and let be such that . Then
Taking the infimum over , we get the result. ∎
Note that for the above definition is consistent with that given in the introduction. The following result is proved by M. Rudelson and R. Vershynin in :
where is a (sufficiently large) universal constant.
This theorem gives a nontrivial estimate for concentration only for sufficiently close to zero. Below, we provide an elementary estension of this result covering the case of “more concentrated” coordinates. First, let us recall a theorem of B. Rogozin:
where is a universal constant.
Now, an easy application of Theorems 4 and 5 gives
Let be a random vector with independent coordinates such that
and via the definition of we get the statement. ∎
where the depend only on and .
for some depending only on . Let
We adopt the following notation: For any subset let
Note that for all we have for and for . Hence, to verify (4) it is sufficient to prove that variables
so (, ) are conditionally independent given . ∎
Define as the collection of all subsets satisfying
with \tau=\frac{1}{2}\bigl{(}\delta^{-1/4}-\delta^{-1/3}\bigr{)}. Then
where depends only on .
For each and let
It is not difficult to see that the events () are independent in view of independence of the entries of .
Fix for a moment any . One can verify that for any and the variables and are i.i.d. given event . It follows that
Take any subset satisfying
Thus, any satisfying (7) belongs to . Clearly,
hence . The argument implies and the result follows. ∎
Next, we combine the result of Lemma 9 with Corollary 6:
Let , , , and be exactly as in Lemma 9 and be a non-random matrix. Then
where and , depend only on .
Let and be defined as in Lemma 9 and take any . Let
By Lemma 8, the subspace and the vector are conditionally independent given , hence the above estimate immediately implies
Since the relation holds for all , in view of Lemma 9 we obtain
Finally, we can prove the main result of the section:
Let be an random matrix having the same distribution as such that -dimensional vectors (, ) are i.i.d. and for any admissible and the variables and are conditionally i.i.d. given event and identical on . For every and , by the formula for the joint distribution of and we get
hence, in view of symmetric distribution of , we have {\mathcal{Q}}\bigl{(}{(a_{ij})}_{H}-{(a_{ij}^{\prime})}_{H},\frac{d}{2}\bigr{)}\leq 1-\frac{r}{2}. Clearly, for every coordinate of the vector , hence by Theorem 5 for all
Thus, vector satisfies condition (5) with . Then, by Lemma 10,
In our proof of Theorem 1, we represent as the union of three subsets:
where is a function of the parameters and of the theorem. Then the smallest singular value of can be estimated by bounding separately over each of the three subsets.
In our representation of , we follow an idea from , where the unit sphere was split into sets of “close to sparse” and “far from sparse” vectors. A similar splitting was also employed in , , where the terms “compressible” and “incompressible” were used instead. On the other hand, our “borderline” is smaller by the order of magnitude than in the mentioned papers.
The next elementary lemma shall be used in conjunction with Proposition 3.
Then there is a finite set of cardinality at most \bigl{(}\frac{C_{\ref{net sparsification lemma}}n}{\varepsilon m}\bigr{)}^{m} such that for any there is with .
For any with , let be an -net for of cardinality at most \bigl{(}\frac{3}{\varepsilon}\bigr{)}^{m}. Define as the union of for all admissible . Then, obviously,
Next, fix any and let be such that . Since , there is with . It remains to note that since , necessarily . ∎
Then for the set and any non-random matrix
Fix any and and define , , ; let be as in Proposition 11 and be the smallest integer greater than such that for all
Let () and define an event
In view of Proposition 11, the upper estimate for and the definition of
Take any and define , , . Since all entries of are bounded by by absolute value, we get ; next, for every
(note that ). Hence, by Proposition 3, we get
Finally, applying the above argument to all , we get the result. ∎
As we noted before, construction of the set corresponding to is not so trivial as in the case of almost -sparse vectors. The reason is that in general the set is much larger than , and we have to apply more delicate arguments to get a satisfactory probabilistic estimate. The construction of for the set of “far from -sparse” vectors is contained in the following lemma:
Let us recall a folklore estimate of the norm of a random matrix with bounded mean zero entries (see, for example, [11, Proposition 2.4]):
Let be an () random matrix with i.i.d. mean zero entries; and assume that a.s. Then for a universal constant
The next lemma highlights a useful property of the vectors from :
For any integer and any there is a set such that , and .
Take any and and let
Obviously, and, since is not almost -sparse, . Let be any partition of into pairwise disjoint subsets of cardinality at most with . Then, clearly, for some , . Setting, , we get the result. ∎
Fix any and . To make the notation more compact, denote and let be the largest number in such that for all
(it is not difficult to see that is well defined). Then, take to be the smallest positive integer such that for all
Let , and let be an random matrix with entries satisfying conditions of the lemma and be any non-random matrix.
where is defined as in Proposition 11. Assume that is non-empty and let consist of all -sparse vectors with and . The first inequality in (9) and a simple calculation show that . Hence, in view of Lemma 16, is non-empty and satisfies (8). By Lemma 12, there is a finite subset of cardinality at most \bigl{(}\frac{nC_{\ref{net sparsification lemma}}}{m\varepsilon}\bigr{)}^{m} such that for any there is with .
For each denote . By Proposition 11,
By the above probability estimates and Lemma 15,
Using the definition of , , and the second inequality in (9), we can estimate the probability as
Hence, by Proposition 3 and the definition of , we get
Finally, applying the above argument to entire set , we obtain the result. ∎
In view of the trivial identity , it is enough to prove the theorem for . Fix any and , let and let be the smallest integer such that and for all
By Propositions 13 and 17 for and we have
Combining the estimates, we get for h=\min\bigl{(}h_{\ref{peaky lemma}}\theta_{\ref{compressible lemma}},h_{\ref{compressible lemma}},h_{\ref{incompressible vectors lemma}}\bigr{)}:
Acknowledgement
I would like to thank my supervisor Prof. N. Tomczak-Jaegermann for valuable suggestions that helped improve structure of the proof.