Introduction to the non-asymptotic analysis of random matrices

Roman Vershynin

1 Introduction

Random matrix theory studies properties of N×nN\times n matrices AA chosen from some distribution on the set of all matrices. As dimensions NN and nn grow to infinity, one observes that the spectrum of AA tends to stabilize. This is manifested in several limit laws, which may be regarded as random matrix versions of the central limit theorem. Among them is Wigner’s semicircle law for the eigenvalues of symmetric Gaussian matrices, the circular law for Gaussian matrices, the Marchenko-Pastur law for Wishart matrices W=AAW=A^{*}A where AA is a Gaussian matrix, the Bai-Yin and Tracy-Widom laws for the extreme eigenvalues of Wishart matrices WW. The books offer thorough introduction to the classical problems of random matrix theory and its fascinating connections.

The asymptotic regime where the dimensions N,nN,n\to\infty is well suited for the purposes of statistical physics, e.g. when random matrices serve as finite-dimensional models of infinite-dimensional operators. But in some other areas including statistics, geometric functional analysis, and compressed sensing, the limiting regime may not be very useful . Suppose, for example, that we ask about the largest singular value smax(A)s_{\max}(A) (i.e. the largest eigenvalue of (AA)1/2(A^{*}A)^{1/2}); to be specific assume that AA is an n×nn\times n matrix whose entries are independent standard normal random variables. The asymptotic random matrix theory answers this question as follows: the Bai-Yin law (see Theorem 5.31) states that

as the dimension nn\to\infty. Moreover, the limiting distribution of smax(A)s_{\max}(A) is known to be the Tracy-Widom law (see ). In contrast to this, a non-asymptotic answer to the same question is the following: in every dimension nn, one has

here CC is an absolute constant (see Theorems 5.32 and 5.39). The latter answer is less precise (because of an absolute constant CC) but more quantitative because for fixed dimensions nn it gives an exponential probability of success.For this specific model (Gaussian matrices),Theorems 5.32 and 5.35 even give a sharp absolute constant C2C\approx 2 here. But the result mentioned here is much more general as we will see later; it only requires independence of rows or columns of AA. This is the kind of answer we will seek in this text – guarantees up to absolute constants in all dimensions, and with large probability.

Tall matrices are approximate isometries

where KK is an appropriate normalization factor and δ1\delta\ll 1. Equivalently, this says that all the singular values of AA are close to each other:

where smin(A)s_{\min}(A) and smax(A)s_{\max}(A) denote the smallest and the largest singular values of AA. Yet equivalently, this means that tall matrices are well conditioned: the condition number of AA is κ(A)=smax(A)/smin(A)(1+δ)/(1δ)1\kappa(A)=s_{\max}(A)/s_{\min}(A)\leq(1+\delta)/(1-\delta)\approx 1.

In the asymptotic regime and for random matrices with independent entries, our heuristic is justified by Bai-Yin’s law, which is Theorem 5.31 below. Loosely speaking, it states that as the dimensions N,nN,n increase to infinity while the aspect ratio N/nN/n is fixed, we have

In these notes, we study N×nN\times n random matrices AA with independent rows or independent columns, but not necessarily independent entries. We develop non-asymptotic versions of (5.1) for such matrices, which should hold for all dimensions NN and nn. The desired results should have the form

with large probability, e.g. 1eN1-e^{-N}, where CC is an absolute constant.More accurately, we should expect C=O(1)C=O(1) to depend on easily computable quantities of the distribution, such as its moments. This will be clear from the context. For tall matrices, where NnN\gg n, both sides of this inequality would be close to each other, which would guarantee that AA is an approximate isometry.

Models and methods

We shall study quite general models of random matrices – those with independent rows or independent columns that are sampled from high-dimensional distributions. We will place either strong moment assumptions on the distribution (sub-gaussian growth of moments), or no moment assumptions at all (except finite variance). This leads us to four types of main results:

Matrices with independent sub-gaussian rows: Theorem 5.39

Matrices with independent heavy-tailed rows: Theorem 5.41

Matrices with independent sub-gaussian columns: Theorem 5.58

Matrices with independent heavy-tailed columns: Theorem 5.62

These four models cover many natural classes of random matrices that occur in applications, including random matrices with independent entries (Gaussian and Bernoulli in particular) and random sub-matrices of orthogonal matrices (random Fourier matrices in particular).

The analysis of these four models is based on a variety of tools of probability theory and geometric functional analysis, most of which have not been covered in the texts on the “classical” random matrix theory. The reader will learn basics on sub-gaussian and sub-exponential random variables, isotropic random vectors, large deviation inequalities for sums of independent random variables, extensions of these inequalities to random matrices, and several basic methods of high dimensional probability such as symmetrization, decoupling, and covering (ε\varepsilon-net) arguments.

Applications

In compressed sensing, the best known measurement matrices are random. A sufficient condition for a matrix to succeed for the purposes of compressed sensing is given by the restricted isometry property. Loosely speaking, this property demands that all sub-matrices of given size be well-conditioned. This fits well in the circle of problems of the non-asymptotic random matrix theory. Indeed, we will see in Section 5.6 that all basic models of random matrices are nice restricted isometries. These include Gaussian and Bernoulli matrices, more generally all matrices with sub-gaussian independent entries, and even more generally all matrices with sub-gaussian independent rows or columns. Also, the class of restricted isometries includes random Fourier matrices, more generally random sub-matrices of bounded orthogonal matrices, and even more generally matrices whose rows are independent samples from an isotropic distribution with uniformly bounded coordinates.

Related sources

This text is a tutorial rather than a survey, so we focus on explaining methods rather than results. This forces us to make some concessions in our choice of the subjects. Concentration of measure and its applications to random matrix theory are only briefly mentioned. For an introduction into concentration of measure suitable for a beginner, see and [49, Chapter 14]; for a thorough exposition see ; for connections with random matrices see . The monograph also offers an introduction into concentration of measure and related probabilistic methods in analysis and geometry, some of which we shall use in these notes.

We completely avoid the important (but more difficult) model of symmetric random matrices with independent entries on and above the diagonal. Starting from the work of Füredi and Komlos , the largest singular value (the spectral norm) of symmetric random matrices has been a subject of study in many works; see e.g. and the references therein.

We also did not even attempt to discuss sharp small deviation inequalities (of Tracy-Widom type) for the extreme eigenvalues. Both these topics and much more are discussed in the surveys , which serve as bridges between asymptotic and non-asymptotic problems in random matrix theory.

Because of the absolute constant CC in (5.2), our analysis of the smallest singular value (the “hard edge”) will only be useful for sufficiently tall matrices, where NC2nN\geq C^{2}n. For square and almost square matrices, the hard edge problem will be only briefly mentioned in Section 5.3. The surveys discuss this problem at length, and they offer a glimpse of connections to other problems of random matrix theory and additive combinatorics.

Many of the results and methods presented in these notes are known in one form or another. Some of them are published while some others belong to the folklore of probability in Banach spaces, geometric functional analysis, and related areas. When available, historic references are given in Section 5.7.

Acknowledgements

The author is grateful to the colleagues who made a number of improving suggestions for the earlier versions of the manuscript, in particular to Richard Chen, Subhroshekhar Ghosh, Alexander Litvak, Deanna Needell, Holger Rauhut, S V N Vishwanathan and the anonymous referees. Special thanks are due to Ulas Ayaz and Felix Krahmer who thoroughly read the entire text, and whose numerous comments led to significant improvements of this tutorial.

2 Preliminaries

The main object of our study will be an N×nN\times n matrix AA with real or complex entries. We shall state all results in the real case; the reader will be able to adjust them to the complex case as well. Usually but not always one should think of tall matrices AA, those for which Nn>1N\geq n>1. By passing to the adjoint matrix AA^{*}, many results can be carried over to “flat” matrices, those for which NnN\leq n.

It is often convenient to study AA through the n×nn\times n symmetric positive-semidefinite matrix the matrix AAA^{*}A. The eigenvalues of A:=AA|A|:=\sqrt{A^{*}A} are therefore non-negative real numbers. Arranged in a non-decreasing order, they are called the singular valuesIn the literature, singular values are also called s-numbers. of AA and denoted s1(A)sn(A)0s_{1}(A)\geq\cdots\geq s_{n}(A)\geq 0. Many applications require estimates on the extreme singular values

The smallest singular value is only of interest for tall matrices, since for N<nN<n one automatically has smin(A)=0s_{\min}(A)=0.

Equivalently, smax(A)s_{\max}(A) and smin(A)s_{\min}(A) are respectively the smallest number MM and the largest number mm such that

The extreme singular values can also be described in terms of the spectral norm of AA, which is by definition

(5.3) gives a link between the extreme singular values and the spectral norm:

where AA^{\dagger} denotes the pseudoinverse of AA; if AA is invertible then A=A1A^{\dagger}=A^{-1}.

2.2 Nets

Nets are convenient means to discretize compact sets. In our study we will mostly need to discretize the unit Euclidean sphere Sn1S^{n-1} in the definition of the spectral norm (5.4). Let us first recall a general definition of an ε\varepsilon-net.

Let (X,d)(X,d) be a metric space and let ε>0\varepsilon>0. A subset Nε\mathcal{N}_{\varepsilon} of XX is called an ε\varepsilon-net of XX if every point xXx\in X can be approximated to within ε\varepsilon by some point yNεy\in\mathcal{N}_{\varepsilon}, i.e. so that d(x,y)εd(x,y)\leq\varepsilon. The minimal cardinality of an ε\varepsilon-net of XX, if finite, is denoted N(X,ε)\mathcal{N}(X,\varepsilon) and is called the covering numberEquivalently, N(X,ε)\mathcal{N}(X,\varepsilon) is the minimal number of balls with radii ε\varepsilon and with centers in XX needed to cover XX. of XX (at scale ε\varepsilon).

From a characterization of compactness we remember that XX is compact if and only if N(X,ε)<\mathcal{N}(X,\varepsilon)<\infty for each ε>0\varepsilon>0. A quantitative estimate on N(X,ε)\mathcal{N}(X,\varepsilon) would give us a quantitative version of compactness of XX.In statistical learning theory and geometric functional analysis, logN(X,ε)\log\mathcal{N}(X,\varepsilon) is called the metric entropy of XX. In some sense it measures the “complexity” of metric space XX. Let us therefore take a simple example of a metric space, the unit Euclidean sphere Sn1S^{n-1} equipped with the Euclidean metricA similar result holds for the geodesic metric on the sphere, since for small ε\varepsilon these two distances are equivalent. d(x,y)=xy2d(x,y)=\|x-y\|_{2}, and estimate its covering numbers.

The unit Euclidean sphere Sn1S^{n-1} equipped with the Euclidean metric satisfies for every ε>0\varepsilon>0 that

This is a simple volume argument. Let us fix ε>0\varepsilon>0 and choose Nε\mathcal{N}_{\varepsilon} to be a maximal ε\varepsilon-separated subset of Sn1S^{n-1}. In other words, Nε\mathcal{N}_{\varepsilon} is such that d(x,y)εd(x,y)\geq\varepsilon for all x,yNεx,y\in\mathcal{N}_{\varepsilon}, xyx\neq y, and no subset of Sn1S^{n-1} containing Nε\mathcal{N}_{\varepsilon} has this property.One can in fact construct Nε\mathcal{N}_{\varepsilon} inductively by first selecting an arbitrary point on the sphere, and at each next step selecting a point that is at distance at least ε\varepsilon from those already selected. By compactness, this algorithm will terminate after finitely many steps and it will yield a set Nε\mathcal{N}_{\varepsilon} as we required.

The maximality property implies that Nε\mathcal{N}_{\varepsilon} is an ε\varepsilon-net of Sn1S^{n-1}. Indeed, otherwise there would exist xSn1x\in S^{n-1} that is at least ε\varepsilon-far from all points in Nε\mathcal{N}_{\varepsilon}. So Nε{x}\mathcal{N}_{\varepsilon}\cup\{x\} would still be an ε\varepsilon-separated set, contradicting the minimality property.

Moreover, the separation property implies via the triangle inequality that the balls of radii ε/2\varepsilon/2 centered at the points in Nε\mathcal{N}_{\varepsilon} are disjoint. On the other hand, all such balls lie in (1+ε/2)B2n(1+\varepsilon/2)B_{2}^{n} where B2nB_{2}^{n} denotes the unit Euclidean ball centered at the origin. Comparing the volume gives \operatorname{vol}\big{(}\frac{\varepsilon}{2}B_{2}^{n}\big{)}\cdot|\mathcal{N}_{\varepsilon}|\leq\operatorname{vol}\big{(}(1+\frac{\varepsilon}{2})B_{2}^{n}\big{)}. Since \operatorname{vol}\big{(}rB_{2}^{n}\big{)}=r^{n}\operatorname{vol}(B_{2}^{n}) for all r0r\geq 0, we conclude that Nε(1+ε2)n/(ε2)n=(1+2ε)n|\mathcal{N}_{\varepsilon}|\leq(1+\frac{\varepsilon}{2})^{n}/(\frac{\varepsilon}{2})^{n}=(1+\frac{2}{\varepsilon})^{n} as required. ∎

Nets allow us to reduce the complexity of computations with linear operators. One such example is the computation of the spectral norm. To evaluate the spectral norm by definition (5.4) one needs to take the supremum over the whole sphere Sn1S^{n-1}. However, one can essentially replace the sphere by its ε\varepsilon-net:

Let AA be an N×nN\times n matrix, and let Nε\mathcal{N}_{\varepsilon} be an ε\varepsilon-net of Sn1S^{n-1} for some ε[0,1)\varepsilon\in[0,1). Then

The lower bound in the conclusion follows from the definition. To prove the upper bound let us fix xSn1x\in S^{n-1} for which A=Ax2\|A\|=\|Ax\|_{2}, and choose yNεy\in\mathcal{N}_{\varepsilon} which approximates xx as xy2ε\|x-y\|_{2}\leq\varepsilon. By the triangle inequality we have AxAy2Axy2εA\|Ax-Ay\|_{2}\leq\|A\|\|x-y\|_{2}\leq\varepsilon\|A\|. It follows that

Taking maximum over all yNεy\in\mathcal{N}_{\varepsilon} in this inequality, we complete the proof. ∎

A similar result holds for symmetric n×nn\times n matrices AA, whose spectral norm can be computed via the associated quadratic form: A=supxSn1Ax,x\|A\|=\sup_{x\in S^{n-1}}|\langle Ax,x\rangle|. Again, one can essentially replace the sphere by its ε\varepsilon-net:

Let AA be a symmetric n×nn\times n matrix, and let Nε\mathcal{N}_{\varepsilon} be an ε\varepsilon-net of Sn1S^{n-1} for some ε[0,1)\varepsilon\in[0,1). Then

Let us choose xSn1x\in S^{n-1} for which A=Ax,x\|A\|=|\langle Ax,x\rangle|, and choose yNεy\in\mathcal{N}_{\varepsilon} which approximates xx as xy2ε\|x-y\|_{2}\leq\varepsilon. By the triangle inequality we have

It follows that Ay,yAx,x2εA=(12ε)A|\langle Ay,y\rangle|\geq|\langle Ax,x\rangle|-2\varepsilon\|A\|=(1-2\varepsilon)\|A\|. Taking the maximum over all yNεy\in\mathcal{N}_{\varepsilon} in this inequality completes the proof. ∎

2.3 Sub-gaussian random variables

In this section we introduce the class of sub-gaussian random variables,It would be more rigorous to say that we study sub-gaussian probability distributions. The same concerns some other properties of random variables and random vectors we study later in this text. However, it is convenient for us to focus on random variables and vectors because we will form random matrices out of them. those whose distributions are dominated by the distribution of a centered gaussian random variable. This is a convenient and quite wide class, which contains in particular the standard normal and all bounded random variables.

Let us briefly recall some of the well known properties of the standard normal random variable XX. The distribution of XX has density 12πex2/2\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2} and is denoted N(0,1)N(0,1). Estimating the integral of this density between tt and \infty one checks that the tail of a standard normal random variable XX decays super-exponentially:

see e.g. [26, Theorem 1.4] for a more precise two-sided inequality. The absolute moments of XX can be computed as

The moment generating function of XX equals

Now let XX be a general random variable. We observe that these three properties are equivalent – a super-exponential tail decay like in (5.5), the moment growth (5.6), and the growth of the moment generating function like in (5.7). We will then focus on the class of random variables that satisfy these properties, which we shall call sub-gaussian random variables.

Let XX be a random variable. Then the following properties are equivalent with parameters Ki>0K_{i}>0 differing from each other by at most an absolute constant factor.The precise meaning of this equivalence is the following. There exists an absolute constant CC such that property ii implies property jj with parameter KjCKiK_{j}\leq CK_{i} for any two properties i,j=1,2,3i,j=1,2,3.

Taking the pp-th root yields property 2 with a suitable absolute constant K2K_{2}.

2. \Rightarrow 3. Assume property 2 holds. As before, by homogeneity we may assume that K2=1K_{2}=1. Let c>0c>0 be a sufficiently small absolute constant. Writing the Taylor series of the exponential function, we obtain

3. \Rightarrow 1. Assume property 3 holds. As before we may assume that K3=1K_{3}=1. Exponentiating and using Markov’s inequalityThis simple argument is sometimes called exponential Markov’s inequality. and then property 3, we have

4. \Rightarrow 1. Assume property 4 holds; we can also assume that K4=1K_{4}=1. Let λ>0\lambda>0 be a parameter to be chosen later. By exponential Markov inequality, and using the bound on the moment generating function given in property 4, we obtain

The constants 11 and ee in properties 1 and 3 respectively are chosen for convenience. Thus the value 11 can be replaced by any positive number and the value ee can be replaced by any number greater than 11.

A random variable XX that satisfies one of the equivalent properties 1 – 3 in Lemma 5.5 is called a sub-gaussian random variable. The sub-gaussian norm of XX, denoted Xψ2\|X\|_{\psi_{2}}, is defined to be the smallest K2K_{2} in property 2. In other words,The sub-gaussian norm is also called ψ2{\psi_{2}} norm in the literature.

The class of sub-gaussian random variables on a given probability space is thus a normed space. By Lemma 5.5, every sub-gaussian random variable XX satisfies:

where C,c>0C,c>0 are absolute constants. Moreover, up to absolute constant factors, Xψ2\|X\|_{\psi_{2}} is the smallest possible number in each of these inequalities.

Classical examples of sub-gaussian random variables are Gaussian, Bernoulli and all bounded random variables.

(Gaussian): A standard normal random variable XX is sub-gaussian with Xψ2C\|X\|_{\psi_{2}}\leq C where CC is an absolute constant. This follows from (5.6). More generally, if XX is a centered normal random variable with variance σ2\sigma^{2}, then XX is sub-gaussian with Xψ2Cσ\|X\|_{\psi_{2}}\leq C\sigma.

(Bounded): More generally, consider any bounded random variable XX, thus XM|X|\leq M almost surely for some MM. Then XX is a sub-gaussian random variable with Xψ2M\|X\|_{\psi_{2}}\leq M. We can write this more compactly as Xψ2X\|X\|_{\psi_{2}}\leq\|X\|_{\infty}.

A remarkable property of the normal distribution is rotation invariance. Given a finite number of independent centered normal random variables XiX_{i}, their sum iXi\sum_{i}X_{i} is also a centered normal random variable, obviously with Var(iXi)=iVar(Xi)\operatorname{Var}(\sum_{i}X_{i})=\sum_{i}\operatorname{Var}(X_{i}). Rotation invariance passes onto sub-gaussian random variables, although approximately:

Consider a finite number of independent centered sub-gaussian random variables XiX_{i}. Then iXi\sum_{i}X_{i} is also a centered sub-gaussian random variable. Moreover,

Using the equivalence of properties 2 and 4 in Lemma 5.5 we conclude that iXiψ2C1K\|\sum_{i}X_{i}\|_{\psi_{2}}\leq C_{1}K where C1C_{1} is an absolute constant. The proof is complete. ∎

The rotation invariance immediately yields a large deviation inequality for sums of independent sub-gaussian random variables:

The rotation invariance (Lemma 5.9) implies the bound iaiXiψ22Ciai2Xiψ22CK2a22\|\sum_{i}a_{i}X_{i}\|_{\psi_{2}}^{2}\leq C\sum_{i}a_{i}^{2}\|X_{i}\|_{\psi_{2}}^{2}\leq CK^{2}\|a\|_{2}^{2}. Property (5.10) yields the required tail decay. ∎

Using moment growth (5.11) instead of the tail decay (5.10), we immediately obtain from Lemma 5.9 a general form of the well known Khintchine inequality:

Let XiX_{i} be a finite number of independent sub-gaussian random variables with zero mean, unit variance, and Xiψ2K\|X_{i}\|_{{\psi_{2}}}\leq K. Then, for every sequence of coefficients aia_{i} and every exponent p2p\geq 2 we have

2.4 Sub-exponential random variables

Although the class of sub-gaussian random variables is natural and quite wide, it leaves out some useful random variables which have tails heavier than gaussian. One such example is a standard exponential random variable – a non-negative random variable with exponential tail decay

To cover such examples, we consider a class of sub-exponential random variables, those with at least an exponential tail decay. With appropriate modifications, the basic properties of sub-gaussian random variables hold for sub-exponentials. In particular, a version of Lemma 5.5 holds with a similar proof for sub-exponential properties, except for property 4 of the moment generating function. Thus for a random variable XX the following properties are equivalent with parameters Ki>0K_{i}>0 differing from each other by at most an absolute constant factor:

A random variable XX that satisfies one of the equivalent properties (5.14) – (5.16) is called a sub-exponential random variable. The sub-exponential norm of XX, denoted Xψ1\|X\|_{\psi_{1}}, is defined to be the smallest parameter K2K_{2}. In other words,

A random variable XX is sub-gaussian if and only if X2X^{2} is sub-exponential. Moreover,

This follows easily from the definition. ∎

The moment generating function of a sub-exponential random variable has a similar upper bound as in the sub-gaussian case (property 4 in Lemma 5.5). The only real difference is that the bound only holds in a neighborhood of zero rather than on the whole real line. This is inevitable, as the moment generating function of an exponential random variable (5.13) does not exist for t1t\geq 1.

Let XX be a centered sub-exponential random variable. Then, for tt such that tc/Xψ1|t|\leq c/\|X\|_{\psi_{1}}, one has

Sub-exponential random variables satisfy a large deviation inequality similar to the one for sub-gaussians (Proposition 5.10). The only significant difference is that two tails have to appear here – a gaussian tail responsible for the central limit theorem, and an exponential tail coming from the tails of each term.

Without loss of generality, we assume that K=1K=1 by replacing XiX_{i} with Xi/KX_{i}/K and tt with t/Kt/K. We use the exponential Markov inequality for the sum S=iaiXiS=\sum_{i}a_{i}X_{i} and with a parameter λ>0\lambda>0:

If λc/a|\lambda|\leq c/\|a\|_{\infty} then λaic|\lambda a_{i}|\leq c for all ii, so Lemma 5.15 yields

Choosing λ=min(t/2Ca22,c/a)\lambda=\min(t/2C\|a\|_{2}^{2},\,c/\|a\|_{\infty}), we obtain that

Let X1,,XNX_{1},\ldots,X_{N} be independent centered sub-exponential random variables, and let K=maxiXiψ1K=\max_{i}\|X_{i}\|_{\psi_{1}}. Then, for every ε0\varepsilon\geq 0, we have

This follows from Proposition 5.16 for ai=1a_{i}=1 and t=εNt=\varepsilon N. ∎

2.5 Isotropic random vectors

We generalize the concepts of sub-gaussian random variables to higher dimensions using one-dimensional marginals.

The definitions of isotropic and sub-gaussian distributions suggest that more generally, natural properties of high-dimensional distributions may be defined via one-dimensional marginals. This is a natural way to generalize properties of random variables to random vectors. For example, we shall call a random vector sub-exponential if all of its one-dimensional marginals are sub-exponential random variables, etc.

This is a direct consequence of the rotation invariance principle, Lemma 5.9. Indeed, for every x=(x1,,xn)Sn1x=(x_{1},\ldots,x_{n})\in S^{n-1} we have

where we used that i=1nxi2=1\sum_{i=1}^{n}x_{i}^{2}=1. This completes the proof. ∎

Let us analyze the basic examples of random vectors introduced earlier in Example 5.21.

(Gaussian, Bernoulli): Gaussian and Bernoulli random vectors are sub-gaussian; their sub-gaussian norms are bounded by an absolute constant. These are particular cases of Lemma 5.24.

(Spherical): A spherical random vector is also sub-gaussian; its sub-gaussian norm is bounded by an absolute constant. Unfortunately, this does not follow from Lemma 5.24 because the coordinates of the spherical vector are not independent. Instead, by rotation invariance, the claim clearly follows from the following geometric fact. For every ε0\varepsilon\geq 0, the spherical cap {xSn1:  x1>ε}\{x\in S^{n-1}:\;x_{1}>\varepsilon\} makes up at most exp(ε2n/2)\exp(-\varepsilon^{2}n/2) proportion of the total area on the sphere.This fact about spherical caps may seem counter-intuitive. For example, for ε=0.1\varepsilon=0.1 the cap looks similar to a hemisphere, but the proportion of its area goes to zero very fast as dimension nn increases. This is a starting point of the study of the concentration of measure phenomenon, see . This can be proved directly by integration, and also by elementary geometric considerations [9, Lemma 2.2].

(Coordinate): Although the coordinate random vector XX is formally sub-gaussian as its support is finite, its sub-gaussian norm is too big: Xψ2=n1\|X\|_{\psi_{2}}=\sqrt{n}\gg 1. So we would not think of XX as a sub-gaussian random vector.

2.6 Sums of independent random matrices

For p=p=\infty, the Schatten norm equals the spectral norm A=maxinsi(A)\|A\|=\max_{i\leq n}s_{i}(A). Using this one can quickly check that already for p=lognp=\log n the Schatten and spectral norms are equivalent: ACpnAeACpn\|A\|_{C_{p}^{n}}\leq\|A\|\leq e\|A\|_{C_{p}^{n}}.

Let A1,,ANA_{1},\ldots,A_{N} be self-adjoint n×nn\times n matrices and ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N} be independent symmetric Bernoulli random variables. Then, for every 2p<2\leq p<\infty, we have

The scalar case of this result, for n=1n=1, recovers the classical Khintchine inequality, Corollary 5.12, for Xi=εiX_{i}=\varepsilon_{i}.

By the equivalence of Schatten and spectral norms for p=lognp=\log n, a version of non-commutative Khintchine inequality holds for the spectral norm:

where C1C_{1} is an absolute constant. The logarithmic factor is unfortunately essential; it role will be clear when we discuss applications of this result to random matrices in the next sections.

Ahlswede and Winter pioneered a different approach to matrix-valued inequalities in probability theory, which was based on trace inequalities like Golden-Thompson inequality. A development of this idea leads to remarkably sharp results. We quote one such inequality from :

Consider a finite sequence XiX_{i} of independent centered self-adjoint random n×nn\times n matrices. Assume we have for some numbers KK and σ\sigma that

This is a direct matrix generalization of a classical Bernstein’s inequality for bounded random variables. To compare it with our version of Bernstein’s inequality for sub-exponentials, Proposition 5.16, note that the probability bound in (5.19) is equivalent to 2n\cdot\exp\big{[}-c\min\big{(}\frac{t^{2}}{\sigma^{2}},\frac{t}{K}\big{)}\big{]} where c>0c>0 is an absolute constant. In both results we see a mixture of gaussian and exponential tails.

3 Random matrices with independent entries

We are ready to study the extreme singular values of random matrices. In this section, we consider the classical model of random matrices whose entries are independent and centered random variables. Later we will study the more difficult models where only the rows or the columns are independent.

The reader may keep in mind some classical examples of N×nN\times n random matrices with independent entries. The most classical example is the Gaussian random matrix AA whose entries are independent standard normal random variables. In this case, the n×nn\times n symmetric matrix AAA^{*}A is called Wishart matrix; it is a higher-dimensional version of chi-square distributed random variables.

The simplest example of discrete random matrices is the Bernoulli random matrix AA whose entries are independent symmetric Bernoulli random variables. In other words, Bernoulli random matrices are distributed uniformly in the set of all N×nN\times n matrices with ±1\pm 1 entries.

Consider an N×nN\times n random matrix AA whose entries are independent centered identically distributed random variables. By now, the limiting behavior of the extreme singular values of AA, as the dimensions N,nN,n\to\infty, is well understood:

Let A=AN,nA=A_{N,n} be an N×nN\times n random matrix whose entries are independent copies of a random variable with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions NN and nn grow to infinity while the aspect ratio n/Nn/N converges to a constant in $$. Then

As we pointed out in the introduction, our program is to find non-asymptotic versions of Bai-Yin’s law. There is precisely one model of random matrices, namely Gaussian, where an exact non-asymptotic result is known:

Let AA be an N×nN\times n matrix whose entries are independent standard normal random variables. Then

The proof of the upper bound, which we borrowed from , is based on Slepian’s comparison inequality for Gaussian processes.Recall that a Gaussian process (Xt)tT(X_{t})_{t\in T} is a collection of centered normal random variables XtX_{t} on the same probability space, indexed by points tt in an abstract set TT.

Therefore Lemma 5.33 applies, and it yields the required bound

While Theorem 5.32 is about the expectation of singular values, it also yields a large deviation inequality for them. It can be deduced formally by using the concentration of measure in the Gauss space.

Let AA be an N×nN\times n matrix whose entries are independent standard normal random variables. Then for every t0t\geq 0, with probability at least 12exp(t2/2)1-2\exp(-t^{2}/2) one has

Later in these notes, we find it more convenient to work with the n×nn\times n positive-definite symmetric matrix AAA^{*}A rather than with the original N×nN\times n matrix AA. Observe that the normalized matrix Aˉ=1NA\bar{A}=\frac{1}{\sqrt{N}}A is an approximate isometry (which is our goal) if and only if AˉAˉ\bar{A}^{*}\bar{A} is an approximate identity:

Conversely, if BB satisfies (5.21) for some δ>0\delta>0 then BBI3max(δ,δ2)\|B^{*}B-I\|\leq 3\max(\delta,\delta^{2}).

Inequality (5.20) holds if and only if \big{|}\|Bx\|_{2}^{2}-1\big{|}\leq\max(\delta,\delta^{2}) for all xSn1x\in S^{n-1}. Similarly, (5.21) holds if and only if \big{|}\|Bx\|_{2}-1\big{|}\leq\delta for all xSn1x\in S^{n-1}. The conclusion then follows from the elementary inequality

Lemma 5.36 reduces our task of proving inequalities (5.2) to showing an equivalent (but often more convenient) bound

3.2 General random matrices with independent entries

Now we pass to a more general model of random matrices whose entries are independent centered random variables with some general distribution (not necessarily normal). The largest singular value (the spectral norm) can be estimated by Latala’s theorem for general random matrices with non-identically distributed entries:

Let AA be a random matrix whose entries aija_{ij} are independent centered random variables with finite fourth moment. Then

If the variance and the fourth moments of the entries are uniformly bounded, then Latala’s result yields smax(A)=O(N+n)s_{\max}(A)=O(\sqrt{N}+\sqrt{n}). This is slightly weaker than our goal (5.2), which is smax(A)=N+O(n)s_{\max}(A)=\sqrt{N}+O(\sqrt{n}) but still satisfactory for most applications. Results of the latter type will appear later in the more general model of random matrices with independent rows or columns.

Similarly, our goal (5.2) for the smallest singular value is smin(A)NO(n)s_{\min}(A)\geq\sqrt{N}-O(\sqrt{n}). Since the singular values are non-negative anyway, such inequality would only be useful for sufficiently tall matrices, NnN\gg n. For almost square and square matrices, estimating the smallest singular value (known also as the hard edge of spectrum) is considerably more difficult. The progress on estimating the hard edge is summarized in . If AA has independent entries, then indeed smin(A)c(Nn)s_{\min}(A)\geq c(\sqrt{N}-\sqrt{n}), and the following is an optimal probability bound:

Let AA be an n×nn\times n random matrix whose entries are independent identically distributed subgaussian random variables with zero mean and unit variance. Then for ε0\varepsilon\geq 0,

where C>0C>0 and c(0,1)c\in(0,1) depend only on the subgaussian norm of the entries.

This result gives an optimal bound for square matrices as well (N=nN=n).

4 Random matrices with independent rows

The picture will vary slightly depending on whether the rows of AA are sub-gaussian or have arbitrary distribution. For heavy-tailed distributions, an extra logarithmic factor has to appear in our desired inequality (5.2). The analysis of sub-gaussian and heavy-tailed matrices will be completely different.

There is an abundance of examples where the results of this section may be useful. They include all matrices with independent entries, whether sub-gaussian such as Gaussian and Bernoulli, or completely general distributions with mean zero and unit variance. In the latter case one is able to surpass the fourth moment assumption which is necessary in Bai-Yin’s law, Theorem 5.31.

Other examples of interest come from non-product distributions, some of which we saw in Example 5.21. Sampling from discrete objects (matrices and frames) fits well in this framework, too. Given a deterministic matrix BB, one puts a uniform distribution on the set of the rows of BB and creates a random matrix AA as before – by sampling some NN random rows from BB. Applications to sampling will be discussed in Section 5.4.4.

The following result goes in the direction of our goal (5.2) for random matrices with independent sub-gaussian rows.

Here C=CKC=C_{K}, c=cK>0c=c_{K}>0 depend only on the subgaussian norm K=maxiAiψ2K=\max_{i}\|A_{i}\|_{\psi_{2}} of the rows.

This result is a general version of Corollary 5.35 (up to absolute constants); instead of independent Gaussian entries we allow independent sub-gaussian rows. This of course covers all matrices with independent sub-gaussian entries such as Gaussian and Bernoulli. It also applies to some natural matrices whose entries are not independent. One such example is a matrix whose rows are independent spherical random vectors (Example 5.25).

The proof is a basic version of a covering argument, and it has three steps. We need to control Ax2\|Ax\|_{2} for all vectors xx on the unit sphere Sn1S^{n-1}. To this end, we discretize the sphere using a net N\mathcal{N} (the approximation step), establish a tight control of Ax2\|Ax\|_{2} for every fixed vector xNx\in\mathcal{N} with high probability (the concentration step), and finish off by taking a union bound over all xx in the net. The concentration step will be based on the deviation inequality for sub-exponential random variables, Corollary 5.17.

Step 1: Approximation. Recalling Lemma 5.36 for the matrix B=A/NB=A/\sqrt{N} we see that the conclusion of the theorem is equivalent to

Using Lemma 5.4, we can evaluate the operator norm in (5.23) on a 14\frac{1}{4}-net N\mathcal{N} of the unit sphere Sn1S^{n-1}:

So to complete the proof it suffices to show that, with the required probability,

By Lemma 5.2, we can choose the net N\mathcal{N} so that it has cardinality N9n|\mathcal{N}|\leq 9^{n}.

Step 2: Concentration. Let us fix any vector xSn1x\in S^{n-1}. We can express Ax22\|Ax\|_{2}^{2} as a sum of independent random variables

where the last inequality follows by the definition of δ\delta and using the inequality (a+b)2a2+b2(a+b)^{2}\geq a^{2}+b^{2} for a,b0a,b\geq 0.

Step 3: Union bound. Taking the union bound over all vectors xx in the net N\mathcal{N} of cardinality N9n|\mathcal{N}|\leq 9^{n}, we obtain

where the second inequality follows for C=CKC=C_{K} sufficiently large, e.g. C=K2ln9/c1C=K^{2}\sqrt{\ln 9/c_{1}}. As we noted in Step 1, this completes the proof of the theorem. ∎

Here as before C=CKC=C_{K}, c=cK>0c=c_{K}>0 depend only on the subgaussian norm K=maxiAiψ2K=\max_{i}\|A_{i}\|_{\psi_{2}} of the rows. This result is a general version of (5.23). It follows by a straighforward modification of the argument of Theorem 5.39.

A more natural, multiplicative form of (5.25) is the following. Assume that Σ1/2Ai\Sigma^{-1/2}A_{i} are isotropic sub-gaussian random vectors, and let KK be the maximum of their sub-gaussian norms. Then for every t0t\geq 0, the following inequality holds with probability at least 12exp(ct2)1-2\exp(-ct^{2}):

Here again C=CKC=C_{K}, c=cK>0c=c_{K}>0. This result follows from Theorem 5.39 applied to the isotropic random vectors Σ1/2Ai\Sigma^{-1/2}A_{i}.

4.2 Heavy-tailed rows

The class of sub-gaussian random variables in Theorem 5.39 may sometimes be too restrictive in applications. For example, if the rows of AA are independent coordinate or frame random vectors (Examples 5.21 and 5.25), they are poorly sub-gaussian and Theorem 5.39 is too weak. In such cases, one would use the following result instead, which operates in remarkable generality.

with probability at least 12nexp(ct2)1-2n\cdot\exp(-ct^{2}), where c>0c>0 is an absolute constant.

with probability at least 12nexp(ct2)1-2n\cdot\exp(-c^{\prime}t^{2}). This is a form of our desired inequality (5.2) for heavy-tailed matrices. We shall discuss this more after the proof.

We shall use the non-commutative Bernstein’s inequality, Theorem 5.29.

We express this random matrix as a sum of independent random matrices:

note that XiX_{i} are independent centered n×nn\times n random matrices.

We estimate the range of XiX_{i} using that Ai2m\|A_{i}\|_{2}\leq\sqrt{m} and m1m\geq 1:

where we again used that m1m\geq 1. This yieldsHere the seemingly crude application of triangle inequality is actually not so loose. If the rows AiA_{i} are identically distributed, then so are Xi2X_{i}^{2}, which makes the triangle inequality above into an equality.

Step 3: Application of the non-commutative Bernstein’s inequality. Applying Theorem 5.29 (see Remark 5.30) and recalling the definitions of ε\varepsilon and δ\delta in (5.29), we we bound the probability in question as

Inequality (5.28) fits our goal (5.2), but not quite. The reason is that the probability bound is only non-trivial if tClognt\geq C\sqrt{\log n}. Therefore, in reality Theorem 5.41 asserts that

with probability, say 0.90.9. This achieves our goal (5.2) up to a logarithmic factor.

The logarithmic factor can not be removed from (5.31) for some heavy-tailed distributions. Consider for instance the coordinate distribution introduced in Example 5.21. In order that smin(A)>0s_{\min}(A)>0 there must be no zero columns in AA. Equivalently, each coordinate vector e1,,ene_{1},\ldots,e_{n} must be picked at least once in NN independent trials (each row of AA picks an independent coordinate vector). Recalling the classical coupon collector’s problem, one must make at least NCnlognN\geq Cn\log n trials to make this occur with high probability. Thus the logarithm is necessary in the left hand side of (5.31).This argument moreover shows the optimality of the probability bound in Theorem 5.41. For example, for t=N/2nt=\sqrt{N}/2\sqrt{n} the conclusion (5.28) implies that AA is well conditioned (i.e. N/2smin(A)smax(A)2N\sqrt{N}/2\leq s_{\min}(A)\leq s_{\max}(A)\leq 2\sqrt{N}) with probability 1nexp(cN/n)1-n\cdot\exp(-cN/n). On the other hand, by the coupon collector’s problem we estimate the probability that smin(A)>0s_{\min}(A)>0 as 1n(11n)N1nexp(N/n)1-n\cdot(1-\frac{1}{n})^{N}\approx 1-n\cdot\exp(-N/n).

A version of Theorem 5.41 holds for general, non-isotropic distributions. It is convenient to state it in terms of the equivalent estimate (5.29):

Here c>0c>0 is an absolute constant. In particular, this inequality yields

Taking square roots and multiplying both sides by N\sqrt{N}, we obtain (5.33). ∎

The almost sure boundedness requirement in Theorem 5.41 may sometimes be too restrictive in applications, and it can be relaxed to a bound in expectation:

The proof of this result is similar to that of Theorem 5.41, except that this time we will use Rudelson’s Corollary 5.28 instead of matrix Bernstein’s inequality. To this end, we need a link to symmetric Bernoulli random variables. This is provided by a general symmetrization argument:

Let (Xi)(X_{i}) be a finite sequence of independent random vectors valued in some Banach space, and (εi)(\varepsilon_{i}) be independent symmetric Bernoulli random variables. Then

We will also need a probabilistic version of Lemma 5.36 on approximate isometries. The proof of that lemma was based on the elementary inequality z21max(z1,z12)|z^{2}-1|\geq\max(|z-1|,|z-1|^{2}) for z0z\geq 0. Here is a probabilistic version:

Step 1: Application of Rudelson’s inequality. As in the proof of Theorem 5.41, we are going to control

where we used Symmetrization Lemma 5.46 with independent symmetric Bernoulli random variables εi\varepsilon_{i} (which are independent of AA as well). The expectation in the right hand side is taken both with respect to the random matrix AA and the signs (εi)(\varepsilon_{i}). Taking first the expectation with respect to (εi)(\varepsilon_{i}) (conditionally on AA) and afterwards the expectation with respect to AA, we obtain by Rudelson’s inequality (Corollary 5.28) that

This inequality is easy to solve in EE. Indeed, considering the cases E1E\leq 1 and E>1E>1 separately, we conclude that

Step 2: Diagonalization. Diagonalizing the matrix AAA^{*}A one checks that

(we replaced the expectation of maximum by the maximum of expectations). Using Lemma 5.47 separately for the two terms on the left hand side, we obtain

Multiplying both sides by N\sqrt{N} completes the proof. ∎

In a way similar to Theorem 5.44 we note that a version of Theorem 5.45 holds for general, non-isotropic distributions.

Here CC is an absolute constant. In particular, this inequality yields

The first part follows by a simple modification of the proof of Theorem 5.45. The second part follows from the first like in Theorem 5.44. ∎

4.3 Applications to estimating covariance matrices

The simplest way to estimate Σ\Sigma is to take some NN independent samples XiX_{i} from the distribution and form the sample covariance matrix ΣN=1Ni=1NXiXi\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}\otimes X_{i}. By the law of large numbers, ΣNΣ\Sigma_{N}\to\Sigma almost surely as NN\to\infty. So, taking sufficiently many samples we are guaranteed to estimate the covariance matrix as well as we want. This, however, does not address the quantitative aspect: what is the minimal sample size NN that guarantees approximation with a given accuracy?

The relation of this question to random matrix theory becomes clear when we arrange the samples Xi=:AiX_{i}=:A_{i} as rows of the N×nN\times n random matrix AA. Then the sample covariance matrix is expressed as ΣN=1NAA\Sigma_{N}=\frac{1}{N}A^{*}A. Note that AA is a matrix with independent rows but usually not independent entries (unless we sample from a product distribution). We worked out the analysis of such matrices in Section 5.4, separately for sub-gaussian and general distributions. As an immediate consequence of Theorem 5.39, we obtain:

Here C=CKC=C_{K} depends only on the sub-gaussian norm K=Xψ2K=\|X\|_{\psi_{2}} of a random vector taken from this distribution.

It follows from (5.25) that for every s0s\geq 0, with probability at least 12exp(cs2)1-2\exp(-cs^{2}) we have ΣNΣmax(δ,δ2)\|\Sigma_{N}-\Sigma\|\leq\max(\delta,\delta^{2}) where δ=Cn/N+s/N\delta=C\sqrt{n/N}+s/\sqrt{N}. The conclusion follows for s=Ctns=C^{\prime}t\sqrt{n} where C=CKC^{\prime}=C^{\prime}_{K} is sufficiently large. ∎

Summarizing, Corollary 5.50 shows that the sample size

A weak point of Corollary 5.50 is that the sub-gaussian norm KK may in turn depend on Σ\|\Sigma\|.

Finally, Theorem 5.44 yields a similar estimation result for arbitrary distributions, possibly heavy-tailed:

It follows from Theorem 5.44 that for every s0s\geq 0, with probability at least 1nexp(cs2)1-n\cdot\exp(-cs^{2}) we have ΣNΣmax(Σ1/2δ,δ2)\|\Sigma_{N}-\Sigma\|\leq\max(\|\Sigma\|^{1/2}\delta,\delta^{2}) where δ=sm/N\delta=s\sqrt{m/N}. Therefore, if N(s/ε)2Σ1mN\geq(s/\varepsilon)^{2}\|\Sigma\|^{-1}m then ΣNΣεΣ\|\Sigma_{N}-\Sigma\|\leq\varepsilon\|\Sigma\|. The conclusion follows with s=Ctlogns=C^{\prime}t\sqrt{\log n} where CC^{\prime} is a sufficiently large absolute constant. ∎

We can summarize this discussion in the following way: the sample size

Without the boundedness assumption on the distribution, Corollary 5.52 may fail. The reasoning is the same as in Remark 5.42: for an isotropic distribution which is highly concentrated at the origin, the sample covariance matrix will likely equal .

A different way to enforce the boundedness assumption is to reject any sample points XiX_{i} that fall outside the centered ball of radius m\sqrt{m}. This is equivalent to sampling from the conditional distribution inside the ball. The conditional distribution satisfies the boundedness requirement, so the results discussed above provide a good covariance estimation for it. In many cases, this estimate works even for the original distribution – namely, if only a small part of the distribution lies outside the ball of radius m\sqrt{m}. We leave the details for the interested reader; see e.g. .

4.4 Applications to random sub-matrices and sub-frames

The absence of any moment hypotheses on the distribution in Section 5.4.2 (except finite variance) makes these results especially relevant for discrete distributions. One such situation arises when one wishes to sample entries or rows from a given matrix BB, thereby creating a random sub-matrix AA. It is a big program to understand what we can learn about BB by seeing AA, see . In other words, we ask – what properties of BB pass onto AA? Here we shall only scratch the surface of this problem: we notice that random sub-matrices of certain size preserve the property of being an approximate isometry.

Consider an M×nM\times n matrix BB such thatThe first hypothesis says BB=MIB^{*}B=MI. Equivalently, Bˉ:=1MB\bar{B}:=\frac{1}{\sqrt{M}}B is an isometry, i.e. Bˉx2=x2\|\bar{B}x\|_{2}=\|x\|_{2} for all xx. Equivalently, the columns of Bˉ\bar{B} are orthonormal. smin(B)=smax(B)=Ms_{\min}(B)=s_{\max}(B)=\sqrt{M}. Let mm be such that all rows BiB_{i} of BB satisfy Bi2m\|B_{i}\|_{2}\leq\sqrt{m}. Let AA be an N×nN\times n matrix obtained by sampling NN random rows from BB uniformly and independently. Then for every t0t\geq 0, with probability at least 12nexp(ct2)1-2n\cdot\exp(-ct^{2}) one has

Note that the conclusion of Corollary 5.55 does not depend on the dimension MM of the ambient matrix BB. This happens because this result is a specific version of sampling from a discrete isotropic distribution (uniform on the rows of BB), where size MM of the support of the distribution is irrelevant.

The hypothesis of Corollary 5.55 impliesTo recall why this is true, take trace of both sides in the identity I=1Mi=1MBiBiI=\frac{1}{M}\sum_{i=1}^{M}B_{i}\otimes B_{i}. that 1Mi=1MBi22=n\frac{1}{M}\sum_{i=1}^{M}\|B_{i}\|_{2}^{2}=n. Hence by Markov’s inequality, most of the rows BiB_{i} satisfy Bi2=O(n)\|B_{i}\|_{2}=O(\sqrt{n}). This indicates that Corollary 5.55 would be often used with m=O(n)m=O(n). Also, to ensure a positive probability of success, the useful magnitude of tt would be tlognt\sim\sqrt{\log n}. With this in mind, the extremal singular values of AA will be close to each other (and to N\sqrt{N}) if Nt2mnlognN\gg t^{2}m\sim n\log n.

Summarizing, Corollary 5.55 states that a random O(nlogn)×nO(n\log n)\times n sub-matrix of an M×nM\times n isometry is an approximate isometry.For the purposes of compressed sensing, we shall study the more difficult uniform problem for random sub-matrices in Section 5.6. There BB itself will be chosen as a column sub-matrix of a given M×MM\times M matrix (such as DFT), and one will need to control all such BB simultaneously, see Example 5.73.

with bounds A=(1ε)NA=(1-\varepsilon)N, B=(1+ε)NB=(1+\varepsilon)N. Here CC is an absolute constant.

5 Random matrices with independent columns

In this section we study the extreme singular values of N×nN\times n random matrices AA with independent columns AjA_{j}. We are guided by our ideal bounds (5.2) as before. The same phenomenon occurs in the column independent model as in the row independent model – sufficiently tall random matrices AA are approximate isometries. As before, being tall will mean NnN\gg n for sub-gaussian distributions and NnlognN\gg n\log n for arbitrary distributions.

i.e. 1NG\frac{1}{N}G is an approximate identity for NnN\gg n. Equivalently, by Lemma 5.36, (5.35) would yield the ideal bounds (5.2) on the extreme singular values of AA.

Unfortunately, the entries of the Gram matrix GG are obviously not independent. To overcome this obstacle we shall use the decoupling technique of probability theory . We observe that there is still enough independence encoded in GG. Consider a principal sub-matrix (AS)(AT)(A_{S})^{*}(A_{T}) of G=AAG=A^{*}A with disjoint index sets SS and TT. If we condition on (Ak)kT(A_{k})_{k\in T} then this sub-matrix has independent rows. Using an elementary decoupling technique, we will indeed seek to replace the full Gram matrix GG by one such decoupled S×TS\times T matrix with independent rows, and finish off by applying results of Section 5.4.

By transposition one can try to reduce our problem to studying the n×Nn\times N matrix AA^{*}. It has independent rows and the same singular values as AA, so one can apply results of Section 5.4. The conclusion would be that, with high probability,

Such estimate is only good for flat matrices (NnN\leq n). For tall matrices (NnN\geq n) the lower bound would be trivial because of the (possibly large) constant CC. So, from now on we can focus on tall matrices (NnN\geq n) with independent columns.

Here we prove a version of Theorem 5.39 for matrices with independent columns.

with probability at least 12exp(ct2)1-2\exp(-ct^{2}), where C=CKC=C^{\prime}_{K}, c=cK>0c=c^{\prime}_{K}>0 depend only on the subgaussian norm K=maxjAjψ2K=\max_{j}\|A_{j}\|_{\psi_{2}} of the columns.

By Lemma 5.36, the conclusion of Theorem 5.58 is equivalent to

Consider a double array of real numbers (aij)i,j=1n(a_{ij})_{i,j=1}^{n} such that aii=0a_{ii}=0 for all ii. Then

where TT is a random subset of [n][n] with average size n/2n/2. In particular,

where the minimum and maximum are over all subsets TT of [n][n].

Step 1: Reductions. Without loss of generality we can assume that the columns AiA_{i} have zero mean. Indeed, multiplying each column AiA_{i} by ±1\pm 1 arbitrarily preserves the extreme singular values of AA, the isotropy of AiA_{i} and the sub-gaussian norms of AiA_{i}. Therefore, by multiplying AiA_{i} by independent symmetric Bernoulli random variables we achieve that AiA_{i} have zero mean.

For t=O(N)t=O(\sqrt{N}) the conclusion of Theorem 5.58 follows from Theorem 5.39 by transposition. Indeed, the n×Nn\times N random matrix AA^{*} has independent rows, so for t0t\geq 0 we have

with probability at least 12exp(cKt2)1-2\exp(-c_{K}t^{2}). Here cK>0c_{K}>0 and we can obviously assume that CK1C_{K}\geq 1. For tCKNt\geq C_{K}\sqrt{N} it follows that smax(A)N+n+2ts_{\max}(A)\leq\sqrt{N}+\sqrt{n}+2t, which yields the conclusion of Theorem 5.58 (the left hand side of (5.36) being trivial). So, it suffices to prove the conclusion for tCKNt\leq C_{K}\sqrt{N}. Let us fix such tt.

It would be useful to have some a priori control of smax(A)=As_{\max}(A)=\|A\|. We thus consider the desired event

Since 3CKNn+CKN+t3C_{K}\sqrt{N}\geq\sqrt{n}+C_{K}\sqrt{N}+t, by (5.37) we see that E\mathcal{E} is likely to occur:

Step 2: Approximation. This step is parallel to Step 1 in the proof of Theorem 5.39, except now we shall choose ε:=δ\varepsilon:=\delta. This way we reduce our task to the following. Let N\mathcal{N} be a 14\frac{1}{4}-net of the unit sphere Sn1S^{n-1} such that N9n|\mathcal{N}|\leq 9^{n}. It suffices to show that with probability at least 12exp(cKt2)1-2\exp(-c^{\prime}_{K}t^{2}) one has

By (5.38), it is enough to show that the probability

satisfies p2exp(cKt2)p\leq 2\exp(-c_{K}^{\prime\prime}t^{2}), where cK>0c_{K}^{\prime\prime}>0 may depend only on KK.

Step 3: Decoupling. As in the proof of Theorem 5.39, we will obtain the required bound for a fixed xNx\in\mathcal{N} with high probability, and then take a union bound over xx. So let us fix any x=(x1,,xn)Sn1x=(x_{1},\ldots,x_{n})\in S^{n-1}. We expand

Since Aj22=N\|A_{j}\|_{2}^{2}=N by assumption and x2=1\|x\|_{2}=1, the first sum equals NN. Therefore, subtracting NN from both sides and dividing by NN, we obtain the bound

The sum in the right hand side is G0x,x\langle G_{0}x,x\rangle where G0G_{0} is the off-diagonal part of the Gram matrix G=AAG=A^{*}A. As we indicated in the beginning of Section 5.5, we are going to replace G0G_{0} by its decoupled version whose rows and columns are indexed by disjoint sets. This is achieved by Decoupling Lemma 5.60: we obtain

We substitute this into (5.39) and take union bound over all choices of xNx\in\mathcal{N} and T[n]T\subseteq[n]. As we know, N9n|\mathcal{N}|\leq 9^{n}, and there are 2n2^{n} subsets TT in [n][n]. This gives

Step 4: Conditioning and concentration. To estimate the probability in (5.5.1), we fix a vector xNx\in\mathcal{N} and a subset T[n]T\subseteq[n] and we condition on a realization of random vectors (Ak)kTc(A_{k})_{k\in T^{c}}. We express

Under our conditioning zz is a fixed vector, so RT(x)R_{T}(x) is a sum of independent random variables. Moreover, if event E\mathcal{E} holds then zz is nicely bounded:

If in turn (5.43) holds then the terms Aj,z\langle A_{j},z\rangle in (5.42) are independent centered sub-gaussian random variables with Aj,zψ23KCKN\|\langle A_{j},z\rangle\|_{\psi_{2}}\leq 3KC_{K}\sqrt{N}. By Lemma 5.9, their linear combination RT(x)R_{T}(x) is also a sub-gaussian random variable with

where C^K\widehat{C}_{K} depends only on KK.

The second inequality follows because RT(x)R_{T}(x) is a sub-gaussian random variable (5.44) whose tail decay is given by (5.10). Here c1,c2>0c_{1},c_{2}>0 are absolute constants. The last inequality follows from the definition of δ\delta. Substituting this into (5.5.1) and choosing CC sufficiently large (so that ln36c2C2/C^K2\ln 36\leq c_{2}C^{2}/\widehat{C}_{K}^{2}), we conclude that

This proves an estimate that we desired in Step 2. The proof is complete. ∎

Some a priori control of the norms of the columns Aj2\|A_{j}\|_{2} is necessary for estimating the extreme singular values, since

With this in mind, it is easy to construct an example showing that a normalization assumption Ai2=N\|A_{i}\|_{2}=\sqrt{N} is essential in Theorem 5.58; it can not even be replaced by a boundedness assumption Ai2=O(N)\|A_{i}\|_{2}=O(\sqrt{N}).

5.2 Heavy-tailed columns

Here we prove a version of Theorem 5.45 for independent heavy-tailed columns.

We thus consider N×nN\times n random matrices AA with independent columns AjA_{j}. In addition to the normalization assumption Aj2=N\|A_{j}\|_{2}=\sqrt{N} already present in Theorem 5.58 for sub-gaussian columns, our new result must also require an a priori control of the off-diagonal part of the Gram matrix G=AA=(Aj,Ak)j,k=1nG=A^{*}A=(\langle A_{j},A_{k}\rangle)_{j,k=1}^{n}.

Our proof of Theorem 5.62 will be based on decoupling, symmetrization and an application of Theorem 5.48 for a decoupled Gram matrix with independent rows. The decoupling is done similarly to Theorem 5.58. However, this time we will benefit from formalizing the decoupling inequality for Gram matrices:

Let BB be a N×nN\times n random matrix whose columns BjB_{j} satisfy Bj2=1\|B_{j}\|_{2}=1. Then

We first note that \|B^{*}B-I\|=\sup_{x\in S^{n-1}}\big{|}\|Bx\|_{2}^{2}-1\big{|}. We fix x=(x1,,xn)Sn1x=(x_{1},\ldots,x_{n})\in S^{n-1} and, expanding as in (5.40), observe that

The first sum equals 11 since Bj2=x2=1\|B_{j}\|_{2}=\|x\|_{2}=1. So by Decoupling Lemma 5.60, a random subset TT of [n][n] with average cardinality n/2n/2 satisfies

The conclusion follows by replacing the expectation by the maximum over TT. ∎

Step 1: Reductions and decoupling. It would be useful to have an a priori bound on smax(A)=As_{\max}(A)=\|A\|. We can obtain this by transposing AA and applying one of the results of Section 5.4. Indeed, the random n×Nn\times N matrix AA^{*} has independent rows AiA_{i}^{*} which by our assumption are normalized as Ai2=Ai2=N\|A_{i}^{*}\|_{2}=\|A_{i}\|_{2}=\sqrt{N}. Applying Theorem 5.45 with the roles of nn and NN switched, we obtain by the triangle inequality that

We use Matrix Decoupling Lemma 5.63 for B=1NAB=\frac{1}{\sqrt{N}}A and obtain

where Γ=Γ(T)\Gamma=\Gamma(T) denotes the decoupled Gram matrix

Let us fix TT; our problem then reduces to bounding the expected norm of Γ\Gamma.

Let us condition on ATcA_{T^{c}}. Treating (Ak)kTc(A_{k})_{k\in T^{c}} as fixed vectors we see that, conditionally, the random matrix Γ\Gamma has independent rows

So we are going to use Theorem 5.48 to bound the norm of Γ\Gamma. To do this we need estimates on (a) the norms and (b) the second moment matrices of the rows Γj\Gamma_{j}.

(b) To evaluate the norms of Γj\Gamma_{j}, jTj\in T, note that Γj22=kTcAj,Ak2\|\Gamma_{j}\|_{2}^{2}=\sum_{k\in T^{c}}\langle A_{j},A_{k}\rangle^{2}. This is easy to bound, because the assumption says that the random variable

Step 3: The norm of the decoupled Gram matrix. We bound the norm of the random T×TcT\times T^{c} Gram matrix Γ\Gamma with (conditionally) independent rows using Theorem 5.48 and Remark 5.49. Since by (5.48) we have \big{\|}\frac{1}{|T|}\sum_{j\in T}\Sigma(\Gamma_{j})\big{\|}\leq\frac{1}{|T|}\sum_{j\in T}\|\Sigma(\Gamma_{j})\|\leq\|A_{T^{c}}\|^{2}, we obtain using (5.49) that

The other term that will appear in the expectation of (5.5.2) is

So, taking the expectation in (5.5.2) and using these bounds, we obtain

where we used that nmn\leq m. Finally, using this estimate in (5.47) we conclude

This establishes the first part of Theorem 5.62. The second part follow by the diagonalization argument as in Step 2 of the proof of Theorem 5.45. ∎

6 Restricted isometries

In this section we consider an application of the non-asymptotic random matrix theory in compressed sensing. For a thorough introduction to compressed sensing, see the introductory chapter of this book and .

The interesting regime for compressed sensing is where we take very few measurements, mnm\ll n. Such matrices AA are not one-to-one, so recovery of xx from yy is not possible for all signals xx. But in practical applications, the amount of “information” contained in the signal is often small. Mathematically this is expressed as sparsity of xx. In the simplest case, one assumes that xx has few non-zero coordinates, say supp(x)kn|\operatorname{supp}(x)|\leq k\ll n. In this case, using any non-degenerate matrix AA one can check that xx can be recovered whenever m>2km>2k using the optimization problem min{supp(x):  Ax=y}\min\{|\operatorname{supp}(x)|:\;Ax=y\}.

This optimization problem is highly non-convex and generally NP-complete. So instead one considers a convex relaxation of this problem, min{x1:  Ax=y}\min\{\|x\|_{1}:\;Ax=y\}. A basic result in compressed sensing, due to Candès and Tao , is that for sparse signals supp(x)k|\operatorname{supp}(x)|\leq k, the convex problem recovers the signal xx from its measurement yy exactly, provided that the measurement matrix AA is quantitatively non-degenerate. Precisely, the non-degeneracy of AA means that it satisfies the following restricted isometry property with δ2k(A)0.1\delta_{2k}(A)\leq 0.1.

An m×nm\times n matrix AA satisfies the restricted isometry property of order k1k\geq 1 if there exists δk0\delta_{k}\geq 0 such that the inequality

In words, AA has a restricted isometry property if AA acts as an approximate isometry on all sparse vectors. Clearly,

where the maximum is over all subsets T[n]T\subseteq[n] with Tk|T|\leq k or T=k|T|=\lfloor k\rfloor.

The concept of restricted isometry can also be expressed via extreme singular values, which brings us to the topic we studied in the previous sections. AA is a restricted isometry if and only if all m×km\times k sub-matrices ATA_{T} of AA (obtained by selecting arbitrary kk columns from AA) are approximate isometries. Indeed, for every δ0\delta\geq 0, Lemma 5.36 shows that the following two inequalities are equivalent up to an absolute constant:

More precisely, (5.53) implies (5.54) and (5.54) implies δk(A)3max(δ,δ2)\delta_{k}(A)\leq 3\max(\delta,\delta^{2}).

Our goal is thus to find matrices that are good restricted isometries. What good means is clear from the goals of compressed sensing described above. First, we need to keep the restricted isometry constant δk(A)\delta_{k}(A) below some small absolute constant, say 0.10.1. Most importantly, we would like the number of measurements mm to be small, ideally proportional to the sparsity knk\ll n.

This is where non-asymptotic random matrix theory enters. We shall indeed show that, with high probability, m×nm\times n random matrices AA are good restricted isometries of order kk with m=O(k)m=O^{*}(k). Here the OO^{*} notation hides some logarithmic factors of nn. Specifically, in Theorem 5.65 we will show that

for sub-gaussian random matrices AA (with independent rows or columns). This is due to the strong concentration properties of such matrices. A general observation of this kind is Proposition 5.66. It says that if for a given xx, a random matrix AA (taken from any distribution) satisfies inequality (5.51) with high probability, then AA is a good restricted isometry.

In Theorem 5.71 we will extend these results to random matrices without concentration properties. Using a uniform extension of Rudelson’s inequality, Corollary 5.28, we shall show that

for heavy-tailed random matrices AA (with independent rows). This includes the important example of random Fourier matrices.

In this section we show that m×nm\times n sub-gaussian random matrices AA are good restricted isometries. We have in mind either of the following two models, which we analyzed in Sections 5.4.1 and 5.5.1 respectively:

Let AA be an m×nm\times n sub-gaussian random matrix with independent rows or columns, which follows either of the two models above. Then the normalized matrix Aˉ=1mA\bar{A}=\frac{1}{\sqrt{m}}A satisfies the following for every sparsity level 1kn1\leq k\leq n and every number δ(0,1)\delta\in(0,1):

with probability at least 12exp(cδ2m)1-2\exp(-c\delta^{2}m). Here C=CKC=C_{K}, c=cK>0c=c_{K}>0 depend only on the subgaussian norm K=maxiAiψ2K=\max_{i}\|A_{i}\|_{\psi_{2}} of the rows or columns of AA.

Let us check that the conclusion follows from Theorem 5.39 for the row-independent model, and from Theorem 5.58 for the column-independent model. We shall control the restricted isometry constant using its equivalent description (5.52). We can clearly assume that kk is a positive integer.

Using Lemma 5.36 for AˉT=1mAT\bar{A}_{T}=\frac{1}{\sqrt{m}}{A_{T}}, we see that (5.56) implies that

Now we take a union bound over all subsets T[n]T\subset[n], T=k|T|=k. Since there are (nk)(en/k)k\binom{n}{k}\leq(en/k)^{k} ways to choose TT, we conclude that

with probability at least 1-\binom{n}{k}\cdot 2\exp(-cs^{2})\geq 1-2\exp\big{(}k\log(en/k)-cs^{2}). Then, once we choose ε>0\varepsilon>0 arbitrarily and let s=C1klog(en/k)+εms=C_{1}\sqrt{k\log(en/k)}+\varepsilon\sqrt{m}, we conclude with probability at least 12exp(cε2m)1-2\exp(-c\varepsilon^{2}m) that

Finally, we apply this statement for ε:=δ/6\varepsilon:=\delta/6. By choosing constant CC in the statement of the theorem sufficiently large, we make mm large enough so that δ0δ/3\delta_{0}\leq\delta/3, which yields 3max(δ0,δ02)δ3\max(\delta_{0},\delta_{0}^{2})\leq\delta. The proof is complete. ∎

The main reason Theorem 5.65 holds is that the random matrix AA has a strong concentration property, i.e. that Aˉx2x2\|\bar{A}x\|_{2}\approx\|x\|_{2} with high probability for every fixed sparse vector xx. This concentration property alone implies the restricted isometry property, regardless of the specific random matrix model:

holds with probability at least 1exp(εm)1-\exp(-\varepsilon m). Then we have the following:

with probability at least 1exp(εm/2)1-\exp(-\varepsilon m/2). Here CC is an absolute constant.

In words, the restricted isometry property can be checked on each individual vector xx with high probability.

Taking maximum over all subsets T[n]T\subseteq[n], T=k|T|=k, we conclude that

On the other hand, by assumption we have for every xNTx\in\mathcal{N}_{T} that

Therefore, taking a union bound over (nk)(en/k)k\binom{n}{k}\leq(en/k)^{k} choices of the set TT and over 9k9^{k} elements xNTx\in\mathcal{N}_{T}, we obtain that

where the last line follows by the assumption on mm. The proof is complete. ∎

6.2 Heavy-tailed restricted isometries

In this section we show that m×nm\times n random matrices AA with independent heavy-tailed rows (and uniformly bounded coefficients) are good restricted isometries. This result will be established in Theorem 5.71. As before, we will prove this by controlling the extreme singular values of all m×km\times k sub-matrices ATA_{T}. For each individual subset TT, this can be achieved using Theorem 5.41: one has

with probability at least 12kexp(ct2)1-2k\cdot\exp(-ct^{2}). Although this optimal probability estimate has optimal order, it is too weak to allow for a union bound over all (nk)=(O(1)n/k)k\binom{n}{k}=(O(1)n/k)^{k} choices of the subset TT. Indeed, in order that 1(nk)2kexp(ct2)>01-\binom{n}{k}2k\cdot\exp(-ct^{2})>0 one would need to take t>klog(n/k)t>\sqrt{k\log(n/k)}. So in order to achieve a nontrivial lower bound in (5.57), one would be forced to take mk2m\geq k^{2}. This is too many measurements; recall that our hope is m=O(k)m=O^{*}(k).

This observation suggests that instead of controlling each sub-matrix ATA_{T} separately, we should learn how to control all ATA_{T} at once. This is indeed possible with the following uniform version of Theorem 5.45:

where l=log(n)logdlogNl=\log(n)\sqrt{\log d}\sqrt{\log N} and where C=CKC=C_{K} may depend on KK only. The maximum is, as usual, over all subsets T[d]T\subseteq[d], Tn|T|\leq n.

The non-uniform prototype of this result, Theorem 5.45, was based on Rudelson’s inequality, Corollary 5.28. In a very similar way, Theorem 5.67 is based on the following uniform version of Rudelon’s inequality.

where l=log(n)logdlogNl=\log(n)\sqrt{\log d}\sqrt{\log N} and where C=CKC=C_{K} may depend on KK only.

The non-uniform Rudelson’s inequality (Corollary 5.28) was a consequence of a non-commutative Khintchine inequality. Unfortunately, there does not seem to exist a way to deduce Proposition 5.68 from any known result. Instead, this proposition is proved using Dudley’s integral inequality for Gaussian processes and estimates of covering numbers going back to Carl, see . It is known however that such usage of Dudley’s inequality is not optimal (see e.g. ). As a result, the logarithmic factors in Proposition 5.68 are probably not optimal.

In contrast to these difficulties with Rudelson’s inequality, proving uniform versions of the other two ingredients of Theorem 5.45 – the deviation Lemma 5.47 and Symmetrization Lemma 5.46 – is straightforward.

The argument is entirely parallel to that of Lemma 5.47. ∎

Let XitX_{it}, 1iN1\leq i\leq N, tTt\in\mathcal{T}, be random vectors valued in some Banach space BB, where T\mathcal{T} is a finite index set. Assume that the random vectors Xi=(Xti)tTX_{i}=(X_{ti})_{t\in\mathcal{T}} (valued in the product space BTB^{\mathcal{T}}) are independent. Let ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N} be independent symmetric Bernoulli random variables. Then

The conclusion follows from Lemma 5.46 applied to random vectors XiX_{i} valued in the product Banach space BTB^{\mathcal{T}} equipped with the norm (Zt)tT=suptTZt|||(Z_{t})_{t\in\mathcal{T}}|||=\sup_{t\in\mathcal{T}}\|Z_{t}\|. The reader should also be able to prove the result directly, following the proof of Lemma 5.46. ∎

where we used Symmetrization Lemma 5.70 with independent symmetric Bernoulli random variables ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N}. The expectation in the right hand side is taken both with respect to the random matrix AA and the signs (εi)(\varepsilon_{i}). First taking the expectation with respect to (εi)(\varepsilon_{i}) (conditionally on AA) and afterwards the expectation with respect to AA, we obtain by Proposition 5.68 that

by Hölder’s inequality. Solving this inequality in EE we conclude that

The proof is completed by a diagonalization argument similar to Step 2 in the proof of Theorem 5.45. One uses there a uniform version of deviation inequality given in Lemma 5.69 for stochastic processes indexed by the sets Tn|T|\leq n. We leave the details to the reader. ∎

The result follows from Theorem 5.67, more precisely from its equivalent statement (5.58). In our notation, it says that

The conclusion of the theorem easily follows. ∎

In the interesting sparsity range klognk\geq\log n and kδ2k\geq\delta^{-2}, the condition in Theorem 5.71 clearly reduces to

(Random Fourier measurements): An important example for Theorem 5.41 is where AA realizes random Fourier measurements. Consider the n×nn\times n Discrete Fourier Transform (DFT) matrix WW with entries

(Random sub-matrices of orthogonal matrices): In a similar way, Theorem 5.71 applies to a random row sub-matrix AA of an arbitrary bounded orthogonal matrix WW. Precisely, AA may consist of mm randomly chosen rows, uniformly and without replacement,Since in the interesting regime very few rows are selected, mnm\ll n, sampling with or without replacement are formally equivalent. For example, see which deals with the model of sampling without replacement. from an arbitrary n×nn\times n matrix W=(wij)W=(w_{ij}) such that WW=nIW^{*}W=nI and with uniformly bounded coefficients, maxijwij=O(1)\max_{ij}|w_{ij}|=O(1). The examples of such WW include the class of Hadamard matrices – orthogonal matrices in which all entries equal ±1\pm 1.

7 Notes

We work with two kinds of moment assumptions for random matrices: sub-gaussian and heavy-tailed. These are the two extremes. By the central limit theorem, the sub-gaussian tail decay is the strongest condition one can demand from an isotropic distribution. In contrast, our heavy-tailed model is completely general – no moment assumptions (except the variance) are required. It would be interesting to analyze random matrices with independent rows or columns in the intermediate regime, between sub-gaussian and heavy-tailed moment assumptions. We hope that for distributions with an appropriate finite moment (say, (2+ε)(2+\varepsilon)th or 44th), the results should be the same as for sub-gaussian distributions, i.e. no logn\log n factors should occur. In particular, tall random matrices (Nn)N\gg n) should still be approximate isometries. This indeed holds for sub-exponential distributions ; see for an attempt to go down to finite moment assumptions.

For Section 5.2

The material presented here is well known. The volume argument presented in Lemma 5.2 is quite flexible. It easily generalizes to covering numbers of more general metric spaces, including convex bodies in Banach spaces. See [60, Lemma 4.16] and other parts of for various methods to control covering numbers.

For Section 5.2.3

The concept of sub-gaussian random variables is due to Kahane . His definition was based on the moment generating function (Property 4 in Lemma 5.5), which automatically required sub-gaussian random variables to be centered. We found it more convenient to use the equivalent Property 3 instead. The characterization of sub-gaussian random variables in terms of tail decay and moment growth in Lemma 5.5 also goes back to .

The rotation invariance of sub-gaussian random variables (Lemma 5.9) is an old observation . Its consequence, Proposition 5.10, is a general form of Hoeffding’s inequality, which is usually stated for bounded random variables. For more on large deviation inequalities, see also notes for Section 5.2.4.

Khintchine inequality is usually stated for the particular case of symmetric Bernoulli random variables. It can be extended for 0<p<20<p<2 using a simple extrapolation argument based on Hölder’s inequality, see [45, Lemma 4.1].

For Section 5.2.4

Proposition 5.16 is a form of Bernstein’s inequality, which is usually stated for bounded random variables in the literature. These forms of Hoeffding’s and Bernstein’s inequalities (Propositions 5.10 and 5.16) are partial cases of a large deviation inequality for general ψα\psi_{\alpha} norms, which can be found in [72, Corollary 2.10] with a similar proof. For a thorough introduction to large deviation inequalities for sums of independent random variables (and more), see the books and the tutorial .

For Section 5.2.5

Isotropic distributions on convex bodies, and more generally isotropic log-concave distributions, are central to asymptotic convex geometry (see ) and computational geometry . A completely different way in which isotropic distributions appear in convex geometry is from John’s decompositions for contact points of convex bodies, see . Such distributions are finitely supported and therefore are usually heavy-tailed.

For an introduction to the concept of frames (Example 5.21), see .

For Section 5.2.6

The non-commutative Khintchine inequality, Theorem 5.26, was first proved by Lust-Piquard with an unspecified constant BpB_{p} in place of CpC\sqrt{p}. The optimal value of BpB_{p} was computed by Buchholz ; see [62, Section 6.5] for an thorough introduction to Buchholz’s argument. For the complementary range 1p21\leq p\leq 2, a corresponding version of non-commutative Khintchine inequality was obtained by Lust-Piquard and Pisier . By a duality argument implicitly contained in and independently observed by Marius Junge, this latter inequality also implies the optimal order Bp=O(p)B_{p}=O(\sqrt{p}), see and [61, Section 9.8].

Rudelson’s Corollary 5.28 was initially proved using a majorizing measure technique; our proof follows Pisier’s argument from based on the non-commutative Khintchine inequality.

For Section 5.3

The “Bai-Yin law” (Theorem 5.31) was established for smax(A)s_{\max}(A) by Geman and Yin, Bai and Krishnaiah . The part for smin(A)s_{\min}(A) is due to Silverstein for Gaussian random matrices. Bai and Yin gave a unified treatment of both extreme singular values for general distributions. The fourth moment assumption in Bai-Yin’s law is known to be necessary .

Theorem 5.32 and its argument is due to Gordon . Our exposition of this result and of Corollary 5.35 follows .

For the recent developments related to the hard edge problem for almost square and square matrices (including Theorem 5.38) see the survey .

For Section 5.4

Theorem 5.39 on random matrices with sub-gaussian rows, as well as its proof by a covering argument, is a folklore in geometric functional analysis. The use of covering arguments in a similar context goes back to Milman’s proof of Dvoretzky’s theorem ; see e.g. and [60, Chapter 4] for an introduction. In the more narrow context of extreme singular values of random matrices, this type of argument appears recently e.g. in .

The breakthrough work on heavy-tailed isotropic distributions is due to Rudelson . He used Corollary 5.28 in the way we described in the proof of Theorem 5.45 to show that 1NAA\frac{1}{N}A^{*}A is an approximate isometry. Probably Theorem 5.41 can also be deduced by a modification of this argument; however it is simpler to use the non-commutative Bernstein’s inequality.

The symmetrization technique is well known. For a slightly more general two-sided inequality than Lemma 5.46, see [45, Lemma 6.3].

The problem of estimating covariance matrices described in Section 5.4.3 is a basic problem in statistics, see e.g. . However, most work in the statistical literature is focused on the normal distribution or general product distributions (up to linear transformations), which corresponds to studying random matrices with independent entries. For non-product distributions, an interesting example is for uniform distributions on convex sets . As we mentioned in Example 5.25, such distributions are sub-exponential but not necessarily sub-gaussian, so Corollary 5.50 does not apply. Still, the sample size N=O(n)N=O(n) suffices to estimate the covariance matrix in this case . It is conjectured that the same should hold for general distributions with finite (e. g. 44th) moment assumption .

Corollary 5.55 on random sub-matrices is a variant of the Rudelson’s result from . The study of random sub-matrices was continued in . Random sub-frames were studied in where a variant of Corollary 5.56 was proved.

For Section 5.5

Theorem 5.58 for sub-gaussian columns seems to be new. However, historically the efforts of geometric functional analysts were immediately focused on the more difficult case of sub-exponential tail decay (given by uniform distributions on convex bodies). An indication to prove results like Theorem 5.58 by decoupling and covering is present in and is followed in .

The normalization condition Aj2=N\|A_{j}\|_{2}=\sqrt{N} in Theorem 5.58 can not be dropped but can be relaxed. Namely, consider the random variable \delta:=\max_{i\leq n}\big{|}\frac{\|A_{j}\|_{2}^{2}}{N}-1\big{|}. Then the conclusion of Theorem 5.58 holds with (5.36) replaced by

Theorem 5.62 for heavy-tailed columns also seems to be new. The incoherence parameter mm is meant to prevent collisions of the columns of AA in a quantitative way. It is not clear whether the logarithmic factor is needed in the conclusion of Theorem 5.62, or whether the incoherence parameter alone takes care of the logarithmic factors whenever they appear. The same question can be raised for all other results for heavy-tailed matrices in Section 5.4.2 and their applications – can we replace the logarithmic factors by more sensitive quantities (e.g. the logarithm of the incoherence parameter)?

For Section 5.6

For a mathematical introduction to compressed sensing, see the introductory chapter of this book and .

A version of Theorem 5.65 was proved in for the row-independent model; an extension from sub-gaussian to sub-exponential distributions is given in . A general framework of stochastic processes with sub-exponential tails is discussed in . For the column-independent model, Theorem 5.65 seems to be new.

Proposition 5.66 that formalizes a simple approach to restricted isometry property based on concentration is taken from . Like Theorem 5.65, it can also be used to show that Gaussian and Bernoulli random matrices are restricted isometries. Indeed, it is not difficult to check that these matrices satisfy a concentration inequality as required in Proposition 5.66 .

Section 5.6.2 on heavy-tailed restricted isometries is an exposition of the results from . Using concentration of measure techniques, one can prove a version of Theorem 5.71 with high probability 1nclog3k1-n^{-c\log^{3}k} rather than in expectation . Earlier, Candes and Tao proved a similar result for random Fourier matrices, although with a slightly higher exponent in the logarithm for the number of measurements in (5.55), m=O(klog6n)m=O(k\log^{6}n). The survey offers a thorough exposition of the material presented in Section 5.6.2 and more.

Bibliography

Index