Non-asymptotic theory of random matrices: extreme singular values

Mark Rudelson, Roman Vershynin

Asymptotic and non-asymptotic problems on random matrices

In a similar way, Marchenko-Pastur law governs the limiting spectrum of n×nn\times n Wishart matrices WN,n=AAW_{N,n}=A^{*}A, where A=AN,nA=A_{N,n} is an N×nN\times n random Gaussian matrix whose entries are independent standard normal random variables. As the dimensions N,nN,n grow to infinity while the aspect ratio n/Nn/N converges to a non-random number y(0,1]y\in(0,1], the spectrum of the normalized Wishart matrices 1NWN,n\frac{1}{N}W_{N,n} is distributed according to the Marchenko-Pastur law with density 12πxy(bx)(xa)\frac{1}{2\pi xy}\sqrt{(b-x)(x-a)} supported on [a,b][a,b] where a=(1y)2a=(1-\sqrt{y})^{2}, b=(1+y)2b=(1+\sqrt{y})^{2}. The meaning of the convergence is similar to the one in Wigner’s semicircle law.

It is widely believed that phenomena typically observed in asymptotic random matrix theory are universal, that is independent of the particular distribution of the entries of random matrices. By analogy with classical probability, when we work with independent standard normal random variables ZiZ_{i}, we know that their normalized sum Sn=1ni=1nZiS_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i} is again a standard normal random variable. This simple but useful fact becomes significantly more useful when we learn that it is asymptotically universal. Indeed, The Central Limit Theorem states that if instead of normal distribution ZiZ_{i} have general identical distribution with zero mean and unit variance, the normalized sum SnS_{n} will still converge (in distribution) to the standard normal random variable as nn\to\infty. In random matrix theory, universality has been established for many results. In particular, Wigner’s semicircle law and Marchenko-Pastur law are known to be universal – like the Central Limit Theorem, they hold for arbitrary distribution of entries with zero mean and unit variance (see for semi-circle law and for Marchenko-Pastur law).

where cc is a positive absolute constant. Such exponential deviation inequalities, which are extremely useful in a number of applications, are non-asymptotic results whose asymptotic prototype is the Central Limit Theorem.

A similar non-asymptotic viewpoint can be adopted in random matrix theory. One would then study spectral properties of random matrices of fixed dimensions. Non-asymptotic results on random matrices are in demand in a number of today’s applications that operate in high but fixed dimensions. This usually happens in statistics where one analyzes data sets with a large but fixed number of parameters, in geometric functional analysis where one works with random operators on finite-dimensional spaces (whose dimensions are large but fixed), in signal processing where the signal is randomly sampled in many but fixed number of points, and in various other areas of science and engineering.

This survey is mainly focused on the non-asymptotic theory of the extreme singular values of random matrices (equivalently, the extreme eigenvalues of sample covariance matrices) where significant progress was made recently. In Section 2 we review estimates on the largest singular value (the soft edge). The more difficult problem of estimating the smallest singular value (the hard edge) is discussed in Section 3, and its connection with the Littlewood-Offord problem in additive combinatorics is the content of Section 4. In Section 5 we discuss several applications of non-asymptotic results to the circular law in asymptotic random matrix theory, to restricted isometries in compressed sensing, and to Kashin’s subspaces in geometric functional analysis.

Extreme singular values

The geometric meaning of the extreme singular values can be clear by considering the best possible factors mm and MM in the two-sided inequality

Asymptotic behavior of extreme singular values

We first turn to the asymptotic theory for the extreme singular values of random matrices with independent entries (and with zero mean and unit variance for normalization purposes). From Marchenko-Pastur law we know that most singular values of such random N×nN\times n matrix AA lie in the interval [Nn,N+n][\sqrt{N}-\sqrt{n},\sqrt{N}+\sqrt{n}]. Under mild additional assumptions, it is actually true that all singular values lie there, so that asymptotically we have

This fact is universal and it holds for general distributions. This was established for smax(A)s_{\max}(A) by Geman and Yin, Bai and Krishnaiah . For smin(A)s_{\min}(A), Silverstein proved this for Gaussian random matrices, and Bai and Yin gave a unified treatment of both extreme singular values for general distributions:

Let A=AN,nA=A_{N,n} be an N×nN\times n random matrix whose entries are independent copies of some random variable with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions NN and nn grow to infinity while the aspect ratio n/Nn/N converges to some number y(0,1]y\in(0,1]. Then

Moreover, without the fourth moment assumption the sequence 1Nsmax(A)\frac{1}{\sqrt{N}}\,s_{\max}(A) is almost surely unbounded .

The limiting distribution of the extreme singular values is known and universal. It is given by the Tracy-Widom law whose cumulative distribution function is

where u(s)u(s) is the solution to the Painlevè II equation u=2u3+suu^{\prime\prime}=2u^{3}+su with the asymptotic u(s)12πs1/4exp(23s3/2)u(s)\sim\frac{1}{2\sqrt{\pi}s^{1/4}}\exp(-\frac{2}{3}s^{3/2}) as ss\to\infty. The occurrence of Tracy-Widom law in random matrix theory and several other areas was the subject of an ICM 2002 talk of Tracy and Widom . This law was initially discovered for the largest eigenvalue of a Gaussian symmetric matrix . For the largest singular values of random matrices with independent entries it was established by Johansson and Johnstone in the Gaussian case, and by Soshnihikov for more general distributions. For the smallest singular value, the corresponding result was recently obtained in a recent work Feldheim and Sodin who gave a unified treatment of both extreme singular values. These results are known under a somewhat stronger subgaussian moment assumption on the entries aija_{ij} of AA, which requires their distribution to decay as fast as the normal random variable:

A random variable XX is subgaussian if there exists K>0K>0 called the subgaussian moment of XX such that

Let A=AN,nA=A_{N,n} be an N×nN\times n random matrix whose entries are independent and identically distributed subgaussian random variables with zero mean and unit variance. Suppose that the dimensions NN and nn grow to infinity while the aspect ratio n/Nn/N stays uniformly bounded by some number y(0,1)y\in(0,1). Then the normalized extreme singular values

converge in distribution to the Tracy-Widom law (2.2).

Non-asymptotic behavior of extreme singular values

It is not entirely clear to what extent the limiting behavior of the extreme singular values such as asymptotics (2.1) manifests itself in fixed dimensions. Given the geometric meaning of the extreme singular values, our interest generally lies in establishing correct upper bounds on smax(A)s_{\max}(A) and lower bounds on smin(A)s_{\min}(A). We start with a folklore observation which yields the correct bound smax(A)N+ns_{\max}(A)\lesssim\sqrt{N}+\sqrt{n} up to an absolute constant factor. The proof is a basic instance of an ε\varepsilon-net argument, a technique proved to be very useful in geometric functional analysis.

Let AA be an N×nN\times n random matrix whose entries are independent mean zero subgaussian random variables whose subgaussian moments are bounded by 11. Then

Here and elsewhere in this paper, C,C1,c,c1C,C_{1},c,c_{1} denote positive absolute constants.

We will sketch the proof for N=nN=n; the general case is similar. The expression smax(A)=maxx,ySn1Ax,ys_{\max}(A)=\max_{x,y\in S^{n-1}}\langle Ax,y\rangle motivates us to first control the random variables Ax,y\langle Ax,y\rangle individually for each pair of vectors x,yx,y on the unit Euclidean sphere Sn1S^{n-1}, and afterwards take the union bound over all such pairs. For fixed x,ySn1x,y\in S^{n-1} the expression Ax,y=i,jaijxjyi\langle Ax,y\rangle=\sum_{i,j}a_{ij}x_{j}y_{i} is a sum of independent random variables, where aija_{ij} denote the independent entries of AA. If aija_{ij} were standard normal random variables, the rotation invariance of the Gaussian distribution would imply that Ax,y\langle Ax,y\rangle is again a standard normal random variable. This property generalizes to subgaussian random variables. Indeed, using moment generating functions one can show that a normalized sum of mean zero subgaussian random variables is again a subgaussian random variable, although the subgaussian moment may increase by an absolute constant factor. Thus

Obviously, we cannot finish the argument by taking the union bound over infinite (even uncountable) number of pairs x,yx,y on the sphere Sn1S^{n-1}. In order to reduce the number of such pairs, we discretize Sn1S^{n-1} by considering its ε\varepsilon-net Nε\mathcal{N}_{\varepsilon} in the Euclidean norm, which is a subset of the sphere that approximates every point of the sphere up to error ε\varepsilon. An approximation argument yields

To gain a control over the size of the net Nε\mathcal{N}_{\varepsilon}, we construct it as a maximal ε\varepsilon-separated subset of Sn1S^{n-1}; then the balls with centers in Nε\mathcal{N}_{\varepsilon} and radii ε/2\varepsilon/2 form a packing inside the centered ball of radius 1+ε/21+\varepsilon/2. A volume comparison gives the useful bound on the cardinality of the net: Nε(1+2/ε)n|\mathcal{N}_{\varepsilon}|\leq(1+2/\varepsilon)^{n}. Choosing for example ε=1/2\varepsilon=1/2, we are well prepared to take the union bound:

We complete the proof by choosing s=Cn+ts=C\sqrt{n}+t with appropriate constant CC. ∎

Let AA be a random matrix whose entries aija_{ij} are independent mean zero random variables with finite fourth moment. Then

For random Gaussian matrices, a much sharper result than in Proposition 2.4 is due to Gordon :

Let AA be an N×nN\times n matrix whose entries are independent standard normal random variables. Then

This result is a consequence of the sharp comparison inequalities for Gaussian processes due to Slepian and Gordon, see and [49, Section 3.3].

Tracy-Widom fluctuations

see . For general random matrices with independent bounded entries, one can use Talagrand’s concentration inequality for convex Lipschitz functions on the cube . Namely, suppose the entries of AA are independent, have mean zero, and are uniformly bounded by 11. Since smax(A)s_{\max}(A) is a convex function of AA, Talagrand’s concentration inequality implies

Inequality (2.3) is optimal for large tt because smax(A)s_{\max}(A) is bounded below by the magnitude of every entry of AA which has the Gaussian tail. But for small deviations, say for t<1t<1, inequality (2.3) is meaningless. Tracy-Widom law predicts a different tail behavior for small deviations tt. It must follow the tail decay of the Tracy–Widom function F1F_{1}, which is not subgaussian , :

The concentration of this type for Hermitian complex and real Gaussian matrices (Gaussian Unitary Ensemble and Gaussian Orthogonal Ensemble) was proved by Ledoux and Aubrun . Recently, Feldheim and Sodin introduced a general approach, which allows to prove the asymptotic Tracy–Widom law and its non-asymptotic counterpart at the same time. Moreover, their method is applicable to random matrices with independent subgaussian entries both in symmetric and non-symmetric case. In particular, for an N×nN\times n random matrix AA with independent subgaussian entries they proved that

The method of also addresses the more difficult smallest singular value. For an N×nN\times n random matrix AA whose dimensions are not too close to each other Feldheim and Sodin proved the Tracy–Widom law for the smallest singular value together with a non-asymptotic version of the bound smin(A)Nns_{\min}(A)\sim\sqrt{N}-\sqrt{n}:

The smallest singular value

In this section we focus on the behavior of the smallest singular value of random N×nN\times n matrices with independent entries. The smallest singular value – the hard edge of the spectrum – is generally more difficult and less amenable to analysis by classical methods of random matrix theory than the largest singular value, the “soft edge”. The difficulty especially manifests itself for square matrices (N=nN=n) or almost square matrices (Nn=o(n)N-n=o(n)). For example, we were guided so far by the asymptotic prediction smin(A)Nns_{\min}(A)\sim\sqrt{N}-\sqrt{n}, which obviously becomes useless for square matrices.

Quantitative invertibility problem

The previous problem is only concerned with whether the hard edge smin(A)s_{\min}(A) is zero or not. This says nothing about the quantitative invertibility problem of the typical size of smin(A)s_{\min}(A). The latter question has a long history. Von Neumann and his associates used random matrices as test inputs in algorithms for numerical solution of systems of linear equations. The accuracy of the matrix algorithms, and sometimes their running time as well, depends on the condition number κ(A)=smax(A)/smin(A)\kappa(A)=s_{\max}(A)/s_{\min}(A). Based on heuristic and experimental evidence, von Neumann and Goldstine predicted that

which together yield κ(A)n\kappa(A)\sim n, see [92, Section 7.8]. In Section 2 we saw several results establishing the second part of (3.1), for the largest singular value.

Estimating the smallest singular value turned out to be more difficult. A more precise form of the prediction smin(A)n1/2s_{\min}(A)\sim n^{-1/2} was repeated by Smale and proved by Edelman and Szarek for random Gaussian matrices AA, those with i.i.d. standard normal entries. For such matrices, the explicit formula for the joint density of the eigenvalues λi\lambda_{i} of 1nAA\frac{1}{n}A^{*}A is available:

Integrating out all the eigenvalues except the smallest one, one can in principle compute its distribution. This approach leads to the following asymptotic result:

Let A=AnA=A_{n} be an n×nn\times n random matrix whose entries are independent standard normal random variables. Then for every fixed ε0\varepsilon\geq 0 one has

The limiting probability behaves as 1exp(εε2/2)ε1-\exp(-\varepsilon-\varepsilon^{2}/2)\sim\varepsilon for small ε\varepsilon. In fact, the following non-asymptotic bound holds for all nn:

This follows from the analysis of Edelman ; Sankar, Spielman and Teng provided a different geometric proof of estimate (3.2) up to an absolute constant factor and extended it to non-centered Gaussian distributions.

Smallest singular values of general random matrices

These methods do not work for general random matrices, especially those with discrete distributions, where rotation invariance and the joint density of eigenvalues are not available. The prediction that smin(A)n1/2s_{\min}(A)\sim n^{-1/2} has been open even for random Bernoulli matrices. Spielman and Teng conjectured in their ICM 2002 talk that estimate (3.2) should hold for the random Bernoulli matrices up to an exponentially small term that accounts for their singularity probability:

where c(0,1)c\in(0,1) is an absolute constant. The first polynomial bound on smin(A)s_{\min}(A) for general random matrices was obtained in . Later Spielman-Teng’s conjecture was proved in up to a constant factor, and for general random matrices:

Let AA be an n×nn\times n random matrix whose entries are independent and identically distributed subgaussian random variables with zero mean and unit variance. Then

where C>0C>0 and c(0,1)c\in(0,1) depend only on the subgaussian moment of the entries.

This result addresses both qualitative and quantitative aspects of the invertibility problem. Setting ε=0\varepsilon=0 we see that AA is invertible with probability at least 1cn1-c^{n}. This generaizes the result of Kahn, Komlos and Szemeredi from Bernoulli to all subgaussian matrices. On the other hand, quantitatively, Theorem 3.2 states that smin(A)n1/2s_{\min}(A)\gtrsim n^{-1/2} with high probability for general random matrices. A corresponding non-asymptotic upper bound smin(A)n1/2s_{\min}(A)\lesssim n^{-1/2} also holds , so we have smin(A)n1/2s_{\min}(A)\sim n^{-1/2} as in von Neumann-Goldstine’s prediction. Both these bounds, upper and lower, hold with high probability under the weaker fourth moment assumption on the entries .

This theory was extended to rectangular random matrices of arbitrary dimensions N×nN\times n in . As we know from Section 2, one expects that smin(A)Nns_{\min}(A)\sim\sqrt{N}-\sqrt{n}. But this would be incorrect for square matrices. To reconcile rectangular and square matrices we make the following correction of our prediction:

For square matrices one would have the correct estimate smin(A)nn1n1/2s_{\min}(A)\sim\sqrt{n}-\sqrt{n-1}\sim n^{-1/2}. The following result extends Theorem 3.2 to rectangular matrices:

Let AA be an n×nn\times n random matrix whose entries are independent and identically distributed subgaussian random variables with zero mean and unit variance. Then

where C>0C>0 and c(0,1)c\in(0,1) depend only on the subgaussian moment of the entries.

This result has been known for a long time for tall matrices, whose the aspect ratio λ=n/N\lambda=n/N is bounded by a sufficiently small constant, see . The optimal bound smin(A)cNs_{\min}(A)\geq c\sqrt{N} can be proved in this case using an ε\varepsilon-net argument similar to Proposition 2.4. This was extended in to smin(A)cλNs_{\min}(A)\geq c_{\lambda}\sqrt{N} for all aspect ratios λ<1c/logn\lambda<1-c/\log n. The dependence of cλc_{\lambda} on the aspect ratio λ\lambda was improved in for Bernoulli matrices and in for general subgaussian matrices. Feldheim-Sodin’s Theorem 2.3 gives precise Tracy-Widom fluctuations of smin(A)s_{\min}(A) for tall matrices, but becomes useless for almost square matrices (say for N<n+n1/3N<n+n^{1/3}). Theorem 3.3 is an an optimal result (up to absolute constants) which covers matrices with all aspect ratios from tall to square. Non-asymptotic estimate (3.3) was extended to matrices whose entries have finite (4+ε)(4+\varepsilon)-th moment in .

Universality of the smallest singular values

The limiting distribution of smin(A)s_{\min}(A) turns out to be universal as dimension nn\to\infty. We already saw a similar universality phenomenon in Theorem 2.3 for genuinely rectangular matrices. For square matrices, the corresponding result was proved by Tao and Vu :

Let AA be an n×nn\times n random matrix whose entries are independent and identically distributed random variables with zero mean, unit variance, and finite KK-th moment where KK is a sufficiently large absolute constant. Let GG be an n×nn\times n random matrix whose entries are independent standard normal random variables. Then

where c>0c>0 depends only on the KK-th moment of the entries.

On a methodological level, this result may be compared in classical probability theory to Berry-Esseen theorem (1.1) which establishes polynomial deviations from the limiting distribution, while Theorems 3.2 and 3.3 bear a similarity with large deviation results like (1.2) which give exponentially small tail probabilities.

Sparsity and invertibility: a geometric proof of Theorem 3.2

We will now sketch the proof of Theorem 3.2 given in . This argument is mostly based on geometric ideas, and it may be useful beyond spectral analysis of random matrices.

Looking at smin(A)=minxSn1Ax2s_{\min}(A)=\min_{x\in S^{n-1}}\|Ax\|_{2} we see that our goal is to bound below Ax2\|Ax\|_{2} uniformly for all unit vectors xx. We will do this separately for sparse vectors and for spread vectors with two very different arguments. Choosing a small absolute constant c0>0c_{0}>0, we first consider the class of sparse vectors

Establishing invertibility of AA on this class is relatively easy. Indeed, when we look at Ax2\|Ax\|_{2} for sparse vectors xx of fixed support supp(x)=I\operatorname{supp}(x)=I of size I=c0n|I|=c_{0}n, we are effectively dealing with the n×c0nn\times c_{0}n submatrix AIA_{I} that consists of the columns of AA indexed by II. The matrix AIA_{I} is tall, so as we said below Theorem 3.3, its smallest singular value can be estimated using the standard ε\varepsilon-net argument. This gives smin(AI)cn1/2s_{\min}(A_{I})\geq cn^{1/2} with probability at least 12en1-2e^{-n}. This allows us to further take the union bound over (nc0n)en/2\binom{n}{c_{0}n}\leq e^{n/2} choices of support II, and conclude that with probability at least 12en/21-2e^{-n/2} we have invertibility on all sparse vectors:

We thus obtained a much stronger bound than we need, n1/2n^{1/2} instead of n1/2n^{-1/2}.

Establishing invertibility of AA on non-sparse vectors is more difficult because there are too many of them. For example, there are exponentially many vectors on Sn1S^{n-1} whose coordinates all equal ±n1/2\pm n^{-1/2} and which have at least a constant distance from each other. This gives us no hope to control such vectors using ε\varepsilon-nets, as any nontrivial net must have cardinality at least 2n2^{n}. So let us now focus on this most difficult class of extremely non-sparse vectors

Once we prove invertibility of AA on these spread vectors, the argument can be completed for all vectors in Sn1S^{n-1} by an approximation argument. Loosely speaking, if xx is close to Sparse{\mathit{Sparse}} we can treat xx as sparse, otherwise xx must have at least cncn coordinates of magnitude xi=O(n1/2)|x_{i}|=O(n^{-1/2}), which allows us to treat xx as spread.

Since the right hand side does not depend on xx, we have proved that

which is simple for the Gaussian distribution (by rotation invariance) and difficult to prove e.g. for the Bernoulli distribution. Together with (3.6) this means that we proved invertibility on all spread vectors:

This is exactly the type of probability bound claimed in Theorem 3.2. As we said, we can finish the proof by combining with the (much better) invertibility on sparse vectors in (3.4), and by an approximation argument.

Littlewood-Offord theory

We need to understand the distribution of sums of independent random variables

Sums of independent random variables is a classical theme in probability theory. The well-developed area of large deviation inequalities like (1.2) demonstrates that SS nicely concentrates around its mean. But our problem is opposite as we need to show that SS is not too concentrated around its mean , and perhaps more generally around any real number. Several results in probability theory starting from the works of Lévy , Kolmogorov and Esséen were concerned with the spread of sums of independent random variables, which is quantified as follows:

The Lévy concentration function of a random variable SS is

Lévy concentration function measures the small ball probability , the likelihood that SS enters a small interval. For continuous distributions one can show that L(S,ε)ε\mathcal{L}(S,\varepsilon)\lesssim\varepsilon for all ε0\varepsilon\geq 0. For discrete distributions this may be false. Instead, a new phenomenon arises for discrete distributions which is unseen in large deviation theory: Lévy concentration function depends on the additive structure of the coefficient vector aa. This is best illustrated on the example where ξi\xi_{i} are independent Bernoulli random variables (±1\pm 1 valued and symmetric). For sparse vectors like a=21/2(1,1,0,,0)a=2^{-1/2}(1,1,0,\ldots,0), Lévy concentration function can be large: L(S,0)=1/2\mathcal{L}(S,0)=1/2. For spread vectors, Berry-Esseen’s theorem (1.1) yields a better bound:

The threshold n1/2n^{-1/2} comes from many cancelations in the sums ±1\sum\pm 1 which occur because all coefficients aia_{i} are equal. For less structured aa, fewer cancelations occur:

Studying the influence of additive structure of the coefficient vector aa on the spread of S=aiξiS=\sum a_{i}\xi_{i} became known as the Littlewood-Offord problem. It was initially developed by Littlewood and Offord , Erdös and Moser , Sárkozy and Szemerédi , Halasz , Frankl and Füredi . For example, if all ai1|a_{i}|\geq 1 then L(S,1)Cn1/2\mathcal{L}(S,1)\leq Cn^{-1/2} , which agrees with (4.2). Similarly, a general fact behind (4.3) is that if aiaj1|a_{i}-a_{j}|\geq 1 for all iji\neq j then L(S,1)Cn3/2\mathcal{L}(S,1)\leq Cn^{-3/2} .

New results on Lévy concentration function

Let ξ1,,ξn\xi_{1},\ldots,\xi_{n} be independent identically distributed mean zero random variables, which are well spread: p:=L(ξk,1)<1p:=\mathcal{L}(\xi_{k},1)<1. Then, for every coefficient vector a=(a1,,an)Sn1a=(a_{1},\ldots,a_{n})\in S^{n-1} and every accuracy level α>0\alpha>0, the sum S=i=1naiξiS=\sum_{i=1}^{n}a_{i}\xi_{i} satisfies

where C,c>0C,c>0 depend only on the spread pp.

One can prove this inequality using Fourier inversion formula, see [80, Section 7.3].

We will show how to prove Theorem 4.2 for Bernoulli random variables ξi\xi_{i}; the general case requires an additional argument. Without loss of generality we can assume that lcdα(a)1πε\operatorname{lcd}_{\alpha}(a)\geq\frac{1}{\pi\varepsilon}. Applying (4.5) for Z=S/εZ=S/\varepsilon, we obtain by independence that

Substituting this into (4.6) yields L(S,ε)C1(ε+2exp(α2/2))\mathcal{L}(S,\varepsilon)\leq C_{1}(\varepsilon+2\exp(-\alpha^{2}/2)) as required. ∎

Theorem 4.2 justifies our empirical observation that Lévy concentration function is proportional to the amount of structure in the coefficient vector, which is measured by the (reciprocal of) its essential least common denominator. As we said, this result is typically used for accuracy level α=cn\alpha=c\sqrt{n} with some small positive constant cc. In this case, the term Cecα2Ce^{-c\alpha^{2}} in (4.4) is exponentially small in nn (thus negligible in applications), and the term CεC\varepsilon is optimal for continuous distributions.

Theorem 4.2 performs best for totally unstructured coefficient vectors aa, those with exponentially large lcdα(a)\operatorname{lcd}_{\alpha}(a). Heuristically, this should be the case for random vectors, as randomness should destroy any structure. While this is not true for general vectors with independent coordinates (e.g. for equal coordinates with random signs), it is true for normals of random hyperplanes:

where c>0c>0 depends only on the subgaussian moment.

Therefore for random normals aa, Theorem 4.2 yealds with high probability a very strong bound on Lévy concentration function:

This brings us back to the distance problem considered in the beginning of this section, which motivated our study of Lévy concentration function:

Let XiX_{i} be random vectors as in Theorem 4.3, and Hn=span(X1,,Xn1)H_{n}=\operatorname{span}(X_{1},\ldots,X_{n-1}). Then

where C,c>0C,c>0 depend only on the subgaussian moment.

As was noticed in (4.1), we can write dist(Xn,Hn)\operatorname{dist}(X_{n},H_{n}) as a sum of independent random variables, and then bound it using (4.7). ∎

Corollary 4.4 offers us exactly the missing piece (3.7) in our proof of the invertibility Theorem 3.2. This completes our analysis of invertibility of square matrices.

Applications

The applications of non-asymptotic theory of random matrices are numerous, and we cannot cover all of them in this note. Instead we concentrate on three different results pertaining to the classical random matrix theory (Circular Law), signal processing (compressed sensing), and geometric functional analysis and theoretical computer science (short Khinchin’s inequality and Kashin’s subspaces).

Asymptotic theory of random matrices provides an important source of heuristics for non-asymptotic results. We have seen an illustration of this in the analysis of the extreme singular values. This interaction between the asymptotic and non-asymptotic theories goes the other way as well, as good non-asymptotic bounds are sometimes crucial in proving the limit laws. One remarkable example of this is the circular law which we will discuss now.

Since det(BnzI)2=det(BnzI)(BnzI)|\text{det}(B_{n}-zI)|^{2}=\text{det}(B_{n}-zI)(B_{n}-zI)^{*}, this is the same as

where s1(n)(z)sn(n)(z)0s_{1}^{(n)}(z)\geq\ldots\geq s_{n}^{(n)}(z)\geq 0 are the eigenvalues of the Hermitian matrix (BnzI)(BnzI)(B_{n}-zI)(B_{n}-zI)^{*}, or in other words, the squares of the singular values of the matrix Vn=BnzIV_{n}=B_{n}-zI. Girko’s argument reduces the proof of the circular law to the convergence of real Stieltjes transforms, and thus to the behavior of the sum above. The logarithmic function is unbounded at and \infty. To control the behavior near \infty, one has to use the bound for the largest singular value of VnV_{n}, which is relatively easy. The analysis of the behavior near requires bounds on the smallest singular value of VnV_{n}, and is therefore more difficult.

Girko’s approach was implemented by Bai , who proved the circular law for random matrices whose entries have bounded sixth moment and bounded density. The bounded density condition was sufficient to take care of the smallest singular value problem. This result was the first manifestation of the universality of the circular law. Still, it did not cover some important classes of random matrices, in particular random Bernoulli matrices. The recent results on the smallest singular value led to a significant progress on establishing the universality of the circular law. A crucial step was done by Götze and Tikhomirov who extended the circular law to all subgaussian matrices using . In fact, the results of actually extended it to all random entries with bounded fourth moment. This was further extended to random variables having bounded moment 2+ε2+\varepsilon in . Finally, in Tao and Vu proved the Circular Law in full generality, with no assumptions besides the unit variance. Their approach was based on the smallest singular value bound from and a novel replacement principle which allowed them to treat the other singular values.

Compressed Sensing

Non-asymptotic random matrix theory provides a right context for the analysis of random measurements in the newly developed area of compressed sensing, see the ICM 2006 talk of Candes . Compressed sensing is an area of information theory and signal processing which studies efficient techniques to reconstruct a signal from a small number of measurements by utilizing the prior knowledge that the signal is sparse .

where x0=supp(x)\|x\|_{0}=|\operatorname{supp}(x)|. This optimization problem is highly non-convex and computationally intractable. So one considers the following convex relaxation of (5.1), which can be efficiently solved by convex programming methods:

One would then need to find conditions when problems (5.1) and (5.2) are equivalent. Candes and Tao showed that this occurs when the measurement matrix AA is a restricted isometry. For an integer sns\leq n, the restricted isometry constant δs(A)\delta_{s}(A) is the smallest number δ0\delta\geq 0 which satisfies

Geometrically, the restricted isometry property guarantees that the geometry of ss-sparse vectors xx is well preserved by the measurement matrix AA. In turns out that in this situation one can reconstruct xx from AxAx by the convex program (5.2):

Assume δ2sc\delta_{2s}\leq c. Then the solution of (5.2) equals xx whenever supp(x)s|\operatorname{supp}(x)|\leq s.

A proof with c=21c=\sqrt{2}-1 is given in ; the current record is c=0.472c=0.472 .

Restricted isometry property can be interpreted in terms of the extreme singular values of submatrices of AA. Indeed, (5.3) equivalently states that the inequality

holds for all m×sm\times s submatrices AIA_{I}, those formed by the columns of AA indexed by sets II of size ss. In light of Sections 2 and 3, it is not surprising that the best known restricted isometry matrices are random matrices. It is actually an open problem to construct deterministic restricted isometry matrices as in Theorem 5.2 below.

The following three types of random matrices are extensively used as measurement matrices in compressed sensing: Gaussian, Bernoulli, and Fourier. Here we summarize their restricted isometry properties, which have the common remarkable feature: the required number of measurements mm is roughly proportional to the sparsity level ss rather than the (possibly much larger) dimension nn.

Let m,n,sm,n,s be positive integers, ε,δ(0,1)\varepsilon,\delta\in(0,1), and let AA be an m×nm\times n measurement matrix.

1. Suppose the entries of AA are independent and identically distributed subgaussian random variables with zero mean and unit variance. Assume that

where CC depends only on ε\varepsilon, δ\delta, and the subgaussian moment. Then with probability at least 1ε1-\varepsilon, the matrix Aˉ=1mA\bar{A}=\frac{1}{\sqrt{m}}A is a restricted isometry with δs(Aˉ)δ\delta_{s}(\bar{A})\leq\delta.

2. Let AA be a random Fourier matrix obtained from the n×nn\times n discrete Fourier transform matrix by choosing mm rows independently and uniformly. Assume that

where CC depends only on ε\varepsilon and δ\delta. Then with probability at least 1ε1-\varepsilon, the matrix Aˉ=1nA\bar{A}=\frac{1}{\sqrt{n}}A is a restricted isometry with δs(Aˉ)δ\delta_{s}(\bar{A})\leq\delta.

For random subgaussian matrices this result was proved in by an ε\varepsilon-net argument, where one first checks the deviation inequality Ax221δ|\|Ax\|_{2}^{2}-1|\leq\delta with exponentially high probability for a fixed vector xx as in (5.3), and afterwards lets xx run over some fine net. For random Fourier matrices the problem is harder. It was first addressed in with a little higher exponent than in (5.4); the exponent 44 was obtained in , and it is conjectured that the optimal exponent is 11.

Short Khinchin’s inequality and Kashin’s subspaces

The average here is taken over all 2n2^{n} possible choices of signs ε\varepsilon (it is the same as the expectation with respect to independent Bernoulli random variables εj\varepsilon_{j}). Since the mid-seventies, the question was around whether Khinchin’s inequality holds for averages over some small sets of signs ε\varepsilon. A trivial lower bound follows by a dimension argument: such a set must contain at least nn points. Here we shall discuss only the case p=1p=1, which is of considerable interest for computer science. This problem can be stated more precisely as follows: as follows:

The case of small δ\delta is more delicate. For a random AA, the bound for β(δ)C\beta(\delta)\leq C can be obtained by the ε\varepsilon-net argument as before. However, an attempt to apply this argument for α(δ)\alpha(\delta) runs into to the same problems as for the smallest singular value. For any fixed δ>0\delta>0 the solution was first obtained first by Johnson and Schechtman who showed that there exists VV satisfying (5.5) with α(δ)=c1/δ\alpha(\delta)=c^{1/\delta}. In this was established for a random set VV (or a random matrix AA) with the same bound on α(δ)\alpha(\delta). Furthermore, the result remains valid even when δ\delta depends on nn, as long as δc/logn\delta\geq c/\log n. The proof uses the smallest singular value bound from in a crucial way. The bound on α(δ)\alpha(\delta) has been further improved in , also using the singular value approach. Finally, a theorem in asserts that for a random set VV the inequalities (5.5) hold with high probability for

Kashin’s subspaces turned out to be useful in theoretical computer science, in particular in the nearest neighbor search and in compressed sensing. At present no deterministic construction is known of such subspaces of dimension nn proportional to NN. The result of shows that a (1+δ)n×n\lfloor(1+\delta)n\rfloor\times n random Bernoulli matrix defines a Kashin’s subspace with α(δ)=cδ2\alpha(\delta)=c\delta^{2}. A random Bernoulli matrix is computationally easier to implement than a random Gaussian matrix, while the distance between the norms is not much worse than in the optimal case. At the same time, since the subspaces generated by a Bernoulli matrix are spanned by random vertices of the discrete cube, they have relatively simple structure, which is possible to analyze.

References