How close is the sample covariance matrix to the actual covariance matrix?

Roman Vershynin

Introduction

Estimation of covariance matrices of high dimensional distributions is a basic problem in multivariate statistics. It arises in diverse applications such as signal processing , genomics , financial mathematics , pattern recognition , geometric functional analysis and computational geometry . The classical and simplest estimator of a covariance matrix is the sample covariance matrix. Unfortunately, the spectral theory of sample covariance matrices has not been well developed except for product distributions (or affine transformations thereof) where one can rely on random matrix theory for matrices with independent entries. This paper addresses the following basic question: how well does the sample covariance matrix approximate the actual covariance matrix in the operator norm?

The use of the operator norm in this problem allows one a good grasp of the spectrum of Σ\Sigma, as each eigenvalue of Σ\Sigma would lie within ε\varepsilon from the corresponding eigenvalue of ΣN\Sigma_{N}.

It is common for today’s applications to operate with increasingly large number of parameters nn, and to require that sample sizes NN be moderate compared with nn. As we impose no a priori structure on the covariance matrix, we must have NnN\geq n for dimension reasons. Note that for some structured covariance matrices, such as sparse or having an off diagonal decay, one can sometimes achieve NN smaller than nn and even comparable to logn\log n, by transforming the sample covariance matrix in order to adhere to the same structure (e.g. by shrinkage of eigenvalues or thresholding of entries). We will not consider structured covariance matrices in this paper; see e.g. and .

2. Two examples

The most extensively studied model in random matrix theory is where XX is a random vector with independent coordinates. However, independence of coordinates can not be justified in some important applications, and in this paper we shall consider general random vectors. Let us illustrate this point with two well studied examples.

Suppose further that the covariance matrix of XX can be approximated by the sample covariance matrix ΣN\Sigma_{N} for some moderate sample size N=N(n,ε)N=N(n,\varepsilon). Such an approximation ΣNIε\|\Sigma_{N}-I\|\leq\varepsilon means simply that a random subset of NN vectors {xj1,,xjN}\{x_{j_{1}},\ldots,x_{j_{N}}\} taken from the tight frame {x1,,xM}\{x_{1},\ldots,x_{M}\} independently and with replacement is still an approximate tight frame:

In other words, a small random subset of a tight frame is still an approximate tight frame; the size of this subset NN does not even depend on the frame size MM. For applications of this type of results in communications see .

3. Sub-gaussian and sub-exponential distributions

then the optimal sample size is N=OK,L,ε(n)N=O_{K,L,\varepsilon}(n).

4. Distributions with finite moments

then the minimal sample size that guarantees approximation (1.1) is N=OK,L,ε(nlogn)N=O_{K,L,\varepsilon}(n\log n). The second moment assumption in (1.5) is very weak; it is equivalent to the boundedness of the covariance matrix, ΣL\|\Sigma\|\leq L. The logarithmic oversampling factor is necessary in this extremely general result, as can be seen from the example of the uniform distribution on the set of nn vectors of Euclidean length n\sqrt{n}. The coupon collector’s problem calls for the size NnlognN\gtrsim n\log n in order for the sample {X1,,XN}\{X_{1},\ldots,X_{N}\} to contain all these vectors, which is obviously required for a nontrivial covariance approximation.

In this paper we prove the Conjecture up to an iterated logarithmic factor.

1. The notation aq,K,L,δba\lesssim_{q,K,L,\delta}b means that aC(q,K,L,δ)ba\leq C(q,K,L,\delta)b where C(q,K,L,δ)C(q,K,L,\delta) depends only on the parameters q,K,L,δq,K,L,\delta; see Section 2.3 for more notation. The logarithms are to the base 22. We put the restriction n4n\geq 4 only to ensure that loglogn1\log\log n\geq 1; Theorem 1.2 and other results below clearly hold for dimensions n=1,2,3n=1,2,3 even without the iterated logarithmic factors.

2. It follows that for every ε>0\varepsilon>0, the desired approximation ΣΣNε\|\Sigma-\Sigma_{N}\|\leq\varepsilon is guaranteed if the sample has size

3. A similar result holds for independent random vectors X1,,XNX_{1},\ldots,X_{N} that are not necessarily identically distributed; we will prove this general result in Theorem 6.1.

4. The boundedness assumption X2Kn\|X\|_{2}\leq K\sqrt{n} in (1.6) can often be weakened or even dropped by a simple modification of Theorem 1.2. This happens, for example, if maxiNXi=O(n)\max_{i\leq N}\|X_{i}\|=O(\sqrt{n}) holds with high probability, as one can apply Theorem 1.2 conditionally on this event. We refer the reader to a thorough discussion of the boundedness assumption in Section 1.3 of .

5. Extreme eigenvalues of sample covariance matrices

Theorem 1.2 can be used to analyze the spectrum of sample covariance matrices ΣN\Sigma_{N}. The case when the random vector XX has i.i.d. coordinates is most studied in random matrix theory. Suppose that both N,nN,n\to\infty while the aspect ratio n/Nβ(0,1]n/N\to\beta\in(0,1]. If the coordinates of XX have unit variance and finite fourth moment, then clearly Σ=I\Sigma=I. The largest eigenvalue λ1(ΣN)\lambda_{1}(\Sigma_{N}) then converges a.s. to (1+β)2(1+\sqrt{\beta})^{2}, and the smallest eigenvalue λn(ΣN)\lambda_{n}(\Sigma_{N}) converges a.s. to (1β)2(1-\sqrt{\beta})^{2}, see . For more on the extreme eigenvalues in both asymptotic regime (N,nN,n\to\infty) and non-asymptotic regime (N,nN,n fixed), see .

Without independence of the coordinates, analyzing the spectrum of sample covariance matrices ΣN\Sigma_{N} becomes significantly harder. Suppose that Σ=I\Sigma=I. For sub-exponential distributions, i.e. those satisfying (1.4), it was proved in that

(A weaker version with extra log(1/β)\log(1/\beta) factors was proved earlier by the same authors in .) Under only finite moment assumption (1.6), Theorem 1.2 clearly yields

Note that for large exponents qq, the factor β122q\beta^{\frac{1}{2}-\frac{2}{q}} becomes close to β\sqrt{\beta}.

6. Norms of random matrices with independent columns

This follows from a result leading to Theorem 1.2, see Corollary 5.2. The bound is optimal up to the loglogN\log\log N factor for matrices with i.i.d. entries, because the operator norm is bounded below by the Euclidean norm of any column and any row. For random matrices with independent entries, estimate (1.7) follows (under the fourth moment assumption) from more general bounds by Seginer and Latala , and even without the loglogN\log\log N factor. Without independence of entries, this bound was proved by the author for products of random matrices with independent entries and deterministic matrices, and also without the loglogN\log\log N factor.

7. Organization of the rest of the paper

The heart of the paper are Sections 3 and 4. In Section 3 we study the structure of series that diverge faster than the iterated logarithm. This structure is used in Section 4 to deduce a decoupling principle. In Section 5 we apply the decoupling principle to norms of random matrices. Specifically, in Theorem 5.1 we estimate the norm of iEXiXi\sum_{i\in E}X_{i}\otimes X_{i} uniformly over subsets EE. We interpret this in Corollary 5.2 as a norm estimate for random matrices with independent columns. In Section 6, we deduce the general form of our main result on approximation of covariance matrices, Theorem 6.1.

Acknowledgement

The author is grateful to the referee for useful suggestions.

Outline of the method and preliminaries

In this expression we may recongize a stochastic process indexed by vectors xx on the sphere. For each fixed xx, we have to control the sum of independent random variables iXi,x2\sum_{i}\langle X_{i},x\rangle^{2} with finite moments. Suppose the bad event occurs – for some xx, this sum is significantly larger than nn. Unfortunately, because of the heavy tails of these random variables, the bad event may occur with polynomial rather than exponential probability nO(1)n^{-O(1)}. This is too weak to control these sums for all xx simultaneously on the nn-dimensional sphere, where ε\varepsilon-nets have exponential sizes in nn. So, instead of working with sums of independent random variables, we try to locate some structure in the summands responsible for the largeness of the sum.

More generally, we shall study the structure of divergent series ibi=\sum_{i}b_{i}=\infty, where bi0b_{i}\geq 0. Let us first suppose that the series diverges faster than logarithmic function, thus

Comparing with the harmonic series we see that the non-increasing rearrangement bib^{*}_{i} of the coefficients at some point must be large:

In other words, one can find n1n_{1} large terms of the sum: there exists an index set I[n]I\subset[n] of size I=n1|I|=n_{1} and such that bi1/n1b_{i}\gg 1/n_{1} for iIi\in I. This collection of large terms (bi)iI(b_{i})_{i\in I} forms a desired structure responsible for the largeness of the series ibi\sum_{i}b_{i}. Such a structure is well suited to our applications where bib_{i} are independent random variables, bi=Xi,x2/nb_{i}=\langle X_{i},x\rangle^{2}/n. Indeed, the events {bi1/n1}\{b_{i}\gg 1/n_{1}\} are independent, and the probability of each such event is easily controlled by finite moment assumptions (2.2) through Markov’s inequality. This line was developed in , but it clearly leads to a loss of logarithmic factor which we are trying to avoid in the present paper.

We will work on the next level of precision, thus studying the structure of series that diverge slower than the logarithmic function but faster than the iterated logarithm. So let us assume that

In Proposition 3.1 we will locate almost the same structure as we had for logarithmically divergent series, except up to some factor loglognllogn\log\log n\ll l\lesssim\log n, as follows. For some n1nn_{1}\leq n there exists an index set I[n]I\subset[n] of size I=n1|I|=n_{1}, such that

2. Decoupling

The structure that we found is well suited to our application where bib_{i} are independent random variables bi=Xi,x2/nb_{i}=\langle X_{i},x\rangle^{2}/n. In this case we have

The probability that this happens is again easy to control using independence of Xi,x\langle X_{i},x\rangle for fixed xx, finite moment assumptions (2.2) and Markov’s inequality. Since there are (nn1)\binom{n}{n_{1}} number of ways to choose the subset II, the probability of the event in (2.1) is bounded by

where the last inequality follows because (nn1)(en/n1)n1\binom{n}{n_{1}}\leq(en/n_{1})^{n_{1}} and since q>2q>2.

Our next task is to unfix xSn1x\in S^{n-1}. The exponential probability estimate we obtained allows us to take the union bound over all xx in the unit sphere of any fixed n1n_{1}-dimensional subspace, since this sphere has an ε\varepsilon-net of size exponential in n1n_{1}. We can indeed assume without loss of generality that the vector xx in our structural event (2.1) lies in the span of (Xi)iI(X_{i})_{i\in I} which is n1n_{1}-dimensional; this can be done by projecting xx onto this span if necessary. Unfortunately, this obviously makes xx depend on the random vectors (Xi)iI(X_{i})_{i\in I} and destroys the independence of random variables Xi,x\langle X_{i},x\rangle. This hurdle calls for a decoupling mechanism, which would make xx in the structural event (2.1) depend on some small fraction of the vectors (Xi)iI(X_{i})_{i\in I}. One would then condition on this fraction of random vectors and use the structural event (2.1) for the other half, which would quickly lead to completion of the argument.

Our decoupling principle, Proposition 4.1, is a deterministic statement that works for fixed vectors XiX_{i}. Loosely speaking, we assume that the structural event (2.1) holds for some xx in the span of (Xi)iI(X_{i})_{i\in I}, and we would like to force xx to lie in the span of a small fraction of these XiX_{i}. We write xx as a linear combination x=iIciXix=\sum_{i\in I}c_{i}X_{i}. The first step of decoupling is to remove the “diagonal” term ciXic_{i}X_{i} from this sum, while retaining the largeness of Xi,x\langle X_{i},x\rangle. This task turns out to be somewhat difficult, and it will force us to refine our structural result for divergent series by adding a domination ingredient into it. This will be done at the cost of another loglogn\log\log n factor. After the diagonal term is removed, the number of terms in the sum for xx will be reduced by a probabilistic selection using Maurey’s empirical method.

3. Notation and preliminaries

We will use the following notation throughout this paper. CC and cc will stand for positive absolute constants; CpC_{p} will denote a quantity which only depends on the parameter pp, and similar notation will be used with more than one parameter. For positive numbers aa and bb, the asymptotic inequality aba\lesssim b means that aCba\leq Cb. Similarly, inequalities of the form ap,qba\lesssim_{p,q}b mean that aCp,qba\lesssim C_{p,q}b. Intervals of integers will be denoted by [n]:={1,,n}[n]:=\{1,\ldots,\lceil n\rceil\} for n0n\geq 0. The cardinality of a finite set II is denoted by I|I|. All logarithms will be to the base 22.

We can view our goal as establishing a law of large numbers in the operator norm, and with quantitative estimates on convergence. Thus we would like to show that the approximation error

4. Sub-gaussian distributions

A solution to the approximation problem is well known and easy for sub-gaussian random vectors, those satisfying (1.3). The optimal sample size here is proportional to the dimension, thus N=OL,ε(n)N=O_{L,\varepsilon}(n). For the reader’s convenience, we recall and prove a general form of this result.

One should compare this with our main result, Theorem 1.2, which yields almost the same conclusion under only finite moment assumptions on the distribution, except for an iterated logarithmic factor and a slight loss of the exponent 1/21/2 (the latter may be inevitable when dealing with finite moments).

The well known proof of Proposition 2.1 is based on Bernstein’s deviation inequality for independent random variables and an ε\varepsilon-net argument. The latter allows to replace the sphere Sn1S^{n-1} in the computation of the norm in (2.3) by a finite ε\varepsilon-net as follows.

Let AA be a Hermitian n×nn\times n matrix, and let Nε\mathcal{N}_{\varepsilon} be an ε\varepsilon-net of the unit Euclidean sphere Sn1S^{n-1} for some ε[0,1)\varepsilon\in[0,1). Then

Let us choose xSn1x\in S^{n-1} for which A=Ax,x\|A\|=|\langle Ax,x\rangle|, and choose yNεy\in\mathcal{N}_{\varepsilon} which approximates xx as xy2ε\|x-y\|_{2}\leq\varepsilon. It follows by the triangle inequality that

Without loss of generality, we can assume that in the sub-gaussian assumption (1.3) we have L=1L=1 by replacing XiX_{i} by Xi/LX_{i}/L. Identity (2.3) expresses the norm in question as a supremum over the unit sphere Sn1S^{n-1}. Next, Lemma 2.2 allows to replace the sphere in (2.3) by its 1/21/2-net N\mathcal{N} at the cost of an absolute constant factor. Moreover, we can arrange so that the net has size N6n|\mathcal{N}|\leq 6^{n}; this follows by a standard volumetric argument (see [17, Lemma 9.5]).

Now we unfix xx. Using (2.4) for each xx in the net N\mathcal{N}, we conclude by the union bound that the event

Now if we choose ε2=(4/c)log(2/δ)n/N\varepsilon^{2}=(4/c)\log(2/\delta)\,n/N, this probability is further bounded below by 1δ1-\delta as required. By the reduction from the sphere to the net mentioned in the beginning of the argument, this completes the proof. ∎

A truncation argument of J. Bourgain reduces the approximation problem to finding an upper bound on

Consider random vectors X1,,XNX_{1},\ldots,X_{N} which satisfy moment assumptions (2.2) for some q>4q>4 and some KK, LL. Then, for every t1t\geq 1, with probability at least 1Ct0.9q1-Ct^{-0.9q} one has

For most part of our argument (through decoupling), we treat XiX_{i} as fixed non-random vectors. The only property we require from XiX_{i} is that they are almost pairwise orthogonal. For random vectors, an almost pairwise orthogonality easily follows from the moment assumptions (2.2):

Consider random vectors X1,,XNX_{1},\ldots,X_{N} which satisfy moment assumptions (2.2) for some q>4q>4 and some KK, LL. Then, for every t1t\geq 1, with probability at least 1Ctq1-Ct^{-q} one has

Structure of divergent series

In this section we study the structure of series which diverge slower than the logarithmic function but faster than an iterated logarithm. This is summarized in the following result.

Then there exist a positive integer llogml\leq\log m and a subset of indices I1[m]I_{1}\subseteq[m] such that the following holds. Given a vector λ=(λi)iI1\lambda=(\lambda_{i})_{i\in I_{1}} such that λ11\|\lambda\|_{1}\leq 1, one can find a further subset I2I1I_{2}\subseteq I_{1} with the following two properties.

(i) (Regularity): the sizes n1:=I1n_{1}:=|I_{1}| and n2:=I2n_{2}:=|I_{2}| satisfy

Furthermore, we can make lCαloglogml\geq C_{\alpha}\log\log m with arbitrarily large CαC_{\alpha} by making the dependence on α\alpha implicit in the assumption (3.1) sufficiently large.

which will be crucial in our application to decoupling.

2. The freedom to choose λ\lambda in Proposition 3.1 ensures that the structure located in the set I1I_{1} is in a sense hereditary; it can pass to subsets I2I_{2}. The domination of λ\lambda by bb on I2I_{2} will be crucial in the removal of the diagonal terms in our application to decoupling.

We now turn to the proof of Proposition 3.1. Heuristically, we will first find many (namely, ll) sets I1I_{1} on which the coefficients are large as in (ii), then choose one that satisfies the regularity condition (i). This regularization step will rely on the following elementary lemma.

Let NN be a positive integer. Consider a nonempty subset J[L]J\subset[L] with size l:=Jl:=|J|. Then, for every α(0,1)\alpha\in(0,1), there exist elements j1,j2Jj_{1},j_{2}\in J that satisfy the following two properties.

We will find j1j_{1}, j2j_{2} as some consecutive terms of the following geometric progression. Define j(0)Jj^{(0)}\in J to be the (unique) element such that

We will only need to consider KK terms of this progression, where K:=min{k:  j(k)L}K:=\min\{k:\;j^{(k)}\geq L\}. Since j(0)l/2l/2j^{(0)}\geq\lceil l/2\rceil\geq l/2, we have j(k)(1+α)kj(0)(1+α)kl/2j^{(k)}\geq(1+\alpha)^{k}j^{(0)}\geq(1+\alpha)^{k}l/2. On the other hand, j(K1)Lj^{(K-1)}\leq L. It follows that Kαlog(2L/l)K\lesssim_{\alpha}\log(2L/l).

We claim that there exists a term 1kK1\leq k\leq K such that

The terms j1:=j(k1)j_{1}:=j^{(k-1)} and j2:=j(k)j_{2}:=j^{(k)} for which (3.2) holds clearly satisfy (i) and (ii) of the conclusion. By increasing j1j_{1} and decreasing j2j_{2} if necessary we can assume that j1,j2Jj_{1},j_{2}\in J. This completes the proof. ∎

We shall prove the following slightly stronger statement. Consider a sufficiently large number

We shall prove that the conclusion of the Proposition holds with (ii) replaced by:

We will construct I1I_{1} and I2I_{2} in the following way. First we decompose the index set [m][m] into blocks Ω1,,ΩL\Omega_{1},\ldots,\Omega_{L} on which the coefficients bib_{i} have similar magnitude; this is possible with LlogmL\sim\log m blocks. Using the assumption b1(loglogm)2\|b\|_{1}\gtrsim(\log\log m)^{2}, one easily checks that many (at least lloglogml\sim\log\log m) of the blocks Ωj\Omega_{j} have large contribution (at least 1/j1/j) to the sum b1=ibi\|b\|_{1}=\sum_{i}|b_{i}|. We will only focus on such large blocks in the rest of the argument. At this point, the union of these blocks could be declared I1I_{1}. We indeed proceed this way, except we first use Regularization Lemma 3.2 on these blocks in order to obtain the required regularity property (ii). Finally, assume we are given coefficients (λi)iI1(\lambda_{i})_{i\in I_{1}} with small sum λi1\sum|\lambda_{i}|\leq 1 as in the assumption. Since the coefficients bib_{i} are large on I1I_{1} by construction, the pigeonhole principle will yield (loosely speaking) a whole block of coefficients Ωj\Omega_{j} where bib_{i} will dominate as required, bi2λi|b_{i}|\geq 2|\lambda_{i}|. We declare this block I2I_{2} and complete the proof. Now we pass to the details of the argument.

Step 1: decomposition of [m][m] into blocks. Without loss of generality,

Indeed, we can clearly assume that bi0b_{i}\geq 0 and λi0\lambda_{i}\geq 0. The estimate bi1b_{i}\leq 1 follows from the assumption: bb1,1\|b\|_{\infty}\leq\|b\|_{1,\infty}\leq 1. Furthermore, the contribution of the small coefficients bi1/mb_{i}\leq 1/m to the norm b1\|b\|_{1} is at most 11, while by the assumption b1αKloglogm2\|b\|_{1}\gtrsim_{\alpha}K\log\log m\geq 2. Hence we can ignore these small coefficients by replacing [m][m] with the subset corresponding to the coefficients bi1/mb_{i}\geq 1/m.

We decompose [m][m] into disjoint subsets (which we call blocks) according to the magnitude of bib_{i}, and we consider the contribution of each block Ωj\Omega_{j} to the norm b1\|b\|_{1}:

By our assumptions on bb, there are at most logm\log m nonempty blocks Ωj\Omega_{j}. As b1,1\|b\|_{1,\infty}\leq 1, Markov’s inequality yields for all jj that

Only the blocks with large contributions BjB_{j} will be of interest to us. Their number is

and we let l=0l=0 if it happens that all Bj<K/jB_{j}<K/j. We claim that there are many such blocks:

Indeed, by the assumption and using (3.6) we can bound

Step 2: construction of the set I1I_{1}. As we said before, we are only interested in blocks Ωj\Omega_{j} with large contributions BjB_{j}. We collect the indices of such blocks into the set

Since the definition of ll implies that BlK/lB^{*}_{l}\geq K/l, we have Jˉl|\bar{J}|\geq l. Then we can apply Regularization Lemma 3.2 to the set {logmj:  jJˉ}[logm]\{\log m-j:\;j\in\bar{J}\}\subseteq[\log m]. Thus we find two elements j,jJˉj^{\prime},j^{\prime\prime}\in\bar{J} satisfying

has size Jαl/loglogm|J|\gtrsim_{\alpha}l/\log\log m. Since by our choice of KK we can assume that K8loglogmK\geq 8\log\log m, we obtain

satisfies the conclusion of the Proposition.

Step 3: sizes of the coefficients bib_{i} for iI1i\in I_{1}. Let us fix jJJˉj\in J\subseteq\bar{J}. From the definition of Jˉ\bar{J} we know that the contribution BjB_{j} is large: BjK/lB_{j}\geq K/l. One consequence of this is a good estimate of the size mjm_{j} of the block Ωj\Omega_{j}. Indeed, the above bound together with (3.6) this implies

Another consequence of the lower bound on BjB_{j} is the required lower bound on the individual coefficients bib_{i}. Indeed, by construction of Ωj\Omega_{j} the coefficients bib_{i}, iΩji\in\Omega_{j} are within the factor 22 from each other. It follows that

In particuar, since by construciton ΩjI1\Omega_{j}\subseteq I_{1}, we have mjI1m_{j}\leq|I_{1}|, which implies

We have thus proved the required lower bound (3.3).

Step 4: Construction of the set I2I_{2}, and sizes of the coefficients bib_{i} for iI2i\in I_{2}. Now suppose we are given a vector λ=(λi)iI1\lambda=(\lambda_{i})_{i\in I_{1}} with λ11\|\lambda\|_{1}\leq 1. We will have to construct a subset I2I1I_{2}\subset I_{1} as in the conclusion, and we will do this as follows. Consider the contribution of the block Ωj\Omega_{j} to the norm λ1\|\lambda\|_{1}:

On the one hand, the sum of all contributions is bounded as jJLj=λ11\sum_{j\in J}L_{j}=\|\lambda\|_{1}\leq 1. On the other hand, there are many terms in this sum: J8l/K|J|\geq 8l/K as we know from (3.9). Therefore, by the pigeonhole principle some of the contributions must be small: there exists j0Jj_{0}\in J such that

This in turn implies via Markov’s inequality that most of the coefficients λi\lambda_{i} for iΩj0i\in\Omega_{j_{0}} are small, and we shall declare these set of indices I2I_{2}. Specifically, since Lj0=iΩj0λjK/8lL_{j_{0}}=\sum_{i\in\Omega_{j_{0}}}\lambda_{j}\leq K/8l and Ωj0=mj0|\Omega_{j_{0}}|=m_{j_{0}}, using Markov’s inequality we see that the set

We have thus proved the required lower bound (3.4).

Step 5: the sizes of the sets I1I_{1} and I2I_{2}. It remains to check the regularity property (i) of the conclusion of the Proposition. We bound

We have thus proved the first inequality in (i) of the conclusion of the Proposition. Similarly, we bound

This completes the proof of (i) of the conclusion, and of the whole Proposition 3.1. ∎

Decoupling

Then the Decoupling Proposition 2.1 of implies that there exist disjoint sets of indices I,J[m]I,J\subseteq[m] such that JδI|J|\leq\delta|I|, and there exists a vector ySn1span(Xj)jJy\in S^{n-1}\cap\operatorname{span}(X_{j})_{j\in J}, such that

Results of this type are best suited for applications to random independent vectors XiX_{i}. Indeed, the events that Xi,y2\langle X_{i},y\rangle^{2} is large are independent for iIi\in I because yy does not depend on (Xi)iI(X_{i})_{i\in I}. The probability of each such event is easy to bound using the moment assumptions (2.2).

Then there exist nonempty disjoint sets of indices I,J[m]I,J\subseteq[m] such that JδI|J|\leq\delta|I|, and there exists a vector ySn1span(Xj)jJy\in S^{n-1}\cap\operatorname{span}(X_{j})_{j\in J}, such that

The proof of the Decoupling Proposition 4.1 will use Proposition 3.1 in order to locate the structure of the large coefficients Xi,x\langle X_{i},x\rangle. The following elementary lemma will be used in the argument.

It is easy to check that each extreme point of the convex set

has exactly KK nonzero coefficients which are equal to ±1/K\pm 1/K. Evaluating the linear form λiai\sum\lambda_{i}a_{i} on these extreme points, we obtain

By replacing XiX_{i} with Xi/K3X_{i}/K_{3} we can assume without loss of generality that K1=K2=K3=1K_{1}=K_{2}=K_{3}=1. By perturbing the vectors XiX_{i} slightly we may also assume that XiX_{i} are all different.

Step 1: separation and the structure of coefficients. Suppose the assumptions of the Proposition hold, and let us choose a vector xSn1x\in S^{n-1} which attains the supremum in (4.3). We denote

and without loss of generality we may assume that ai0a_{i}\neq 0. We also denote

We choose parameter α=α(r,r,r,δ)(0,1)\alpha=\alpha(r,r^{\prime},r^{\prime\prime},\delta)\in(0,1) sufficiently small; its choice will become clear later on in the argument. At this point, we may assume that

We can use Structure Proposition 3.1 to locate the structure in the coefficients aia_{i}. To this end, we apply this result for bi=ai2/nˉb_{i}=a_{i}^{2}/\bar{n} and obtain a number llogml\leq\log m and a subset of indices I1[m]I_{1}\subseteq[m]. We can also assume that ll is sufficiently large – larger than an arbitrary quantity which depends on α\alpha.

Since a vector xSn1x\in S^{n-1} satisfies Xi/ai,x=1\langle X_{i}/a_{i},x\rangle=1 for all iI1i\in I_{1} (in fact for all i[m]i\in[m]), a separation argument for the convex hull K:=conv(Xi/ai)iI1K:=\operatorname{conv}(X_{i}/a_{i})_{i\in I_{1}} yields the existence of a vector xˉconv(K0)\bar{x}\in\operatorname{conv}(K\cup 0) that satisfies

We express xˉ\bar{x} as a convex combination

We then read the conclusion of Structure Proposition 3.1 as follows. There exists a futher subset of indices I2I1I_{2}\subseteq I_{1} such that the sizes n1:=I1n_{1}:=|I_{1}| and n2:=I2n_{2}:=|I_{2}| are regular in the sense that

and the coefficients on I1I_{1} and I2I_{2} are large:

Furthermore, we can make ll sufficiently large depending on α\alpha, say l100/α2l\geq 100/\alpha^{2}.

Step 2: random selection. We will reduce the number of terms n1n_{1} in the sum (4.5) defining xˉ\bar{x} using random selection, trying to bring this number down to about n2n_{2}. As is usual in dealing with sums of independent random variables, we will need to ensure that all summands λiXi/ai\lambda_{i}X_{i}/a_{i} have controlled magnitudes. To this end, we have Xi2n\|X_{i}\|_{2}\leq\sqrt{n} by the assumption, and we can bound 1/ai1/a_{i} through (4.7). Finally, we have an a priori bound λi1\lambda_{i}\leq 1 on the coefficients of the convex combination. However, the latter bound will turn out to be too weak, and we will need λiα1/n2\lambda_{i}\lesssim_{\alpha}1/n_{2} instead. To make this happen, instead of the sets I1I_{1} and I2I_{2} we will be working on their large subsets I1I_{1}^{\prime} and I2I_{2}^{\prime} defines as

where CαC_{\alpha} is a sufficiently large quantity whose value we will choose later. By Markov’s inequality, this incurs almost no loss of coefficients:

We will perform a random selection on I1I_{1} using B. Maurey’s empirical method . Guided by the representation (4.5) of xˉ\bar{x} as a convex combination, we will treat λi\lambda_{i} as probabilities, thus introducing a random vector VV with distribution

Finally, for Cα:=Cα/αC_{\alpha}^{\prime}:=C_{\alpha}/\alpha, we consider the average of about n2n_{2} such vectors:

We would like to think of yˉ\bar{y} as a random version of the vector xˉ\bar{x}. This is certainly true in expectation:

Also, like xˉ\bar{x}, the random vector yˉ\bar{y} is a convex combination of terms Xi/aiX_{i}/a_{i} (now even with equal weights). The advantage of yˉ\bar{y} over xˉ\bar{x} is that it is a convex combination of much fewer terms, as n2/Cαn2n1n_{2}/C_{\alpha}^{\prime}\ll n_{2}\leq n_{1}. In the next two steps, we will check that yˉ\bar{y} is similar to xˉ\bar{x} in the sense that its norm is also well bounded above, and at least n2\sim n_{2} of the inner products Xi/ai,yˉ\langle X_{i}/a_{i},\bar{y}\rangle are still nicely bounded below.

Step 3: control of the norm. By independence, we have

where the last inequality follows because I1I1I_{1}^{\prime}\subseteq I_{1} and iI1λi1\sum_{i\in I_{1}}\lambda_{i}\leq 1.

Since nˉn\bar{n}\geq n, (4.7) gives us the lower bound

Together with the assumption Xi22n\|X_{i}\|_{2}^{2}\leq n, this implies that

Since xˉ22=1ln1/n2\|\bar{x}\|_{2}^{2}=1\leq ln_{1}/n_{2}, we conclude that with probability at least 0.90.9, one has

Step 4: removal of the diagonal term. We know from (4.4) that Xi/ai,xˉ1\langle X_{i}/a_{i},\bar{x}\rangle\geq 1 for many terms XiX_{i}. We would like to replace xˉ\bar{x} by its random version yˉ\bar{y}, establishing a lower bound Xk/ak,yˉ1\langle X_{k}/a_{k},\bar{y}\rangle\geq 1 for many terms XkX_{k}. But at the same time, our main goal is decoupling, in which we would need to make the random vector yˉ\bar{y} independent of those terms XkX_{k}. To make this possible, we will first remove from the sum (4.10) defining yˉ\bar{y} the “diagonal” term containing XkX_{k}, and we call the resulting vector yˉ(k)\bar{y}^{(k)}.

To make this precise, let us fix kI2I1I1k\in I_{2}^{\prime}\subseteq I_{1}^{\prime}\subseteq I_{1}. We consider independent random vectors

Similarly to the definition (4.10) of yˉ\bar{y}, we define

As we said before, we would like to show that the random variable

is bounded below by a constant with high probability. First, we will estimate its mean

To estimate the terms in the right hand side, note that Xi/ai,xˉ1\langle X_{i}/a_{i},\bar{x}\rangle\geq 1 by (4.4) and Xk22n\|X_{k}\|_{2}^{2}\leq n by the assumption. Now is the crucial point when we use that ai2a_{i}^{2} dominate λi\lambda_{i} as in the second inequality in (4.8). This allows us to bound the “diagonal” term as

Step 5: control of the inner products. We would need a stronger statement than (4.14) – that ZkZ_{k} is bounded below not only in expectation but also with high probability. We will get this immediately by Chebyshev’s inequality if we can upper bound the variance of ZkZ_{k}. In a way similar to Step 3, we estimate

Now we need to estimate the various terms in the right hand side of (4).

We start with the estimate on the inner products, collecting them into

Recall that, by the construction of λi\lambda_{i} and of I1II_{1}^{\prime}\subseteq I, we have iI1λi1\sum_{i\in I_{1}^{\prime}}\lambda_{i}\leq 1 and λiCα/n2\lambda_{i}\leq C_{\alpha}/n_{2} for iI1i\in I_{1}^{\prime}. We use Lemma 4.2 on order statistics to obtain the bound

Finally, we use our weak orthonormality assumption (4.1) to conclude that

To complete the bound on the variance of ZkZ_{k} in (4) it remains to obtain some good lower bounds on aka_{k} and aia_{i}. Since kI2I2k\in I_{2}^{\prime}\subseteq I_{2}, (4.8) yields

Similarly we can bound the coefficients aia_{i} in (4): using (4.7) we have ai2nˉ/ln1a_{i}^{2}\geq\bar{n}/{ln_{1}} since since iI1Ii\in I_{1}^{\prime}\subseteq I. But here we will not simply replace nˉ\bar{n} by nn, as we shall try to use ai2a_{i}^{2} to offset the term (N/n2)1/r(N/n_{2})^{1/r^{\prime}} in the estimate on SS. To this end, we note that nˉ(N/m)1/rm(N/n1)1/rn1\bar{n}\geq(N/m)^{1/r}m\geq(N/n_{1})^{1/r}n_{1} because mn1m\geq n_{1}. Therefore, using the last inequality in (4.6) and that NmN\geq m, we have

Combining the estimates on SS, aka_{k} and aia_{i}, we conclude our lower bound (4) on the variance of ZkZ_{k} as follows:

Combining this with the lower bound (4.14) on the expectation, we conclude by Chebyshev’s inequality the desired estimate

Step 6: decoupling. We are nearing the completion of the proof. Let us consider the good events

To show that each Ek\mathcal{E}_{k} occurs with high probability, we note that by definition of yˉ\bar{y} and yˉ(k)\bar{y}^{(k)} one has

From this and using (4.17) we conclude that

An application of Fubini theorem yields that with probability at least 0.90.9, at least (120α)I2(1-20\alpha)|I_{2}^{\prime}| of the events Ek\mathcal{E}_{k} hold simultaneously. More accurately, with probability at least 0.90.9 the following event occurs, which we denote by E\mathcal{E}. There exists a subset II2I\subseteq I_{2}^{\prime} of size I(120α)I2|I|\geq(1-20\alpha)|I_{2}^{\prime}| such that Ek\mathcal{E}_{k} holds for all kIk\in I. Note that using (4.9) and choosing CαC_{\alpha} sufficiently large we have

Recall that the norm bound (4.11) also holds with high probability 0.90.9. Hence with probability at least 0.80.8, both E\mathcal{E} and this norm bound holds. Let us fix a realization of our random variables for which this happens. Then, first of all, by definition of Ek\mathcal{E}_{k} we have

Next, we are going to observe that yˉ\bar{y} lies in the span of few vectors XiX_{i}. Indeed, by construction yˉ(k)\bar{y}^{(k)} lies in the span of the vectors Yj(k)Y_{j}^{(k)} for j[n2/Cα]j\in[n_{2}/C_{\alpha}^{\prime}]. Each such Yj(k)Y_{j}^{(k)} by construction lies in the span of the vectors XiX_{i}, iI1I1i\in I_{1}\setminus I_{1}^{\prime} and of one vector Vj(k)V_{j}^{(k)}. Finally, each such vector Vj(k)V_{j}^{(k)}, again by construction, is either equal zero or VjV_{j}, which in turn equals Xi0X_{i_{0}} for some i0ki_{0}\neq k. Since E\mathcal{E} holds, we have yˉ=yˉ(k)\bar{y}=\bar{y}^{(k)} for all kIk\in I. This implies that there exists a subset I0[m]I_{0}\subseteq[m] (consisting of the indices i0i_{0} as above) with the following properties. Firstly, I0I_{0} does not contain any of indices kIk\in I; in other words I0I_{0} is disjoint from II. Secondly, this set is small: I0n2/Cα|I_{0}|\leq n_{2}/C_{\alpha}^{\prime}. Thirdly, yˉ\bar{y} lies in the span of XiX_{i}, iI0(I1I1)i\in I_{0}\cup(I_{1}\setminus I_{1}^{\prime}). We claim that this set of indices,

satisfies the conclusion of the Proposition.

Since II and I0I_{0} are disjoint and II2I1I\subseteq I_{2}^{\prime}\subseteq I_{1}^{\prime}, it follows that II and JJ are disjoint as required. Moreover, by (4.9) and by choosing CαC_{\alpha}, CαC_{\alpha}^{\prime} sufficiently large we have

When we combine this with (4.18) and choose α\alpha sufficiently small depending on δ\delta, we achieve

as required. Finally, we claim that the normalized vector

satisfies the conclusion of the Proposition. Indeed, we already noted that yˉspan(Xj)jJ\bar{y}\in\operatorname{span}(X_{j})_{j\in J}, as required. Next, for each kII2I2k\in I\subseteq I_{2}^{\prime}\subseteq I_{2} we have

We can get rid of l2l^{2} in this estimate using the bound

where the last inequality follows by choosing α\alpha sufficiently small depending on rr, rr^{\prime\prime}. This completes the proof of Decoupling Proposition 4.1. ∎

Norms of random matrices with independent columns

In this section we apply our decoupling principle, Proposition 4.1, to estimate norms of random matrices with independent columns. As we said, a simple truncation argument of J. Bourgain reduces the approximation problem for covariance matrices to bounding the norm of the random matrix iEXiXi\sum_{i\in E}X_{i}\otimes X_{i} uniformly over index sets EE. The following result gives such an estimate for random vectors XiX_{i} with finite moment assumptions.

We can state Theorem 5.1 in terms of random matrices with independent columns.

Moreover, with the same probability all n×mn\times m submatrices BB of AA simultaneously satisfy the following for all 4mN4\leq m\leq N:

Consider the event E\mathcal{E} that both required bounds (2.5) and (5.1) hold.

Recalling (2.5) and (5.1) we see that the assumptions of Decoupling Proposition 4.1 hold for 1/r=4/p1/r=4/p, 1/r=4/q1/r^{\prime}=4/q, r=rr^{\prime\prime}=r^{\prime}, K1=K=1K_{1}=K=1, K2=CqtK_{2}=C_{q}\sqrt{t} for suitably large CqC_{q}, K3=max(K1,K2,100t2)K_{3}=\max(K_{1},K_{2},100t^{2}), and for δ=δ(p,q)>0\delta=\delta(p,q)>0 sufficiently small (to be chosen later). Applying Decoupling Proposition 4.1 we obtain disjoint index sets I,JE[N]I,J\subseteq E\subseteq[N] with sizes

and a vector ySn1span(Xj)jJy\in S^{n-1}\cap\operatorname{span}(X_{j})_{j\in J} such that

We will need to discretize the set of possible vectors yy. Let

and consider an ε\varepsilon-net NJ\mathcal{N}_{J} of the sphere Sn1span(Xj)jJS^{n-1}\cap\operatorname{span}(X_{j})_{j\in J}. As in known by a volumetric argument (see e.g. Lemma 2.6), one can choose such a net with cardinality

We can assume that the random set NJ\mathcal{N}_{J} depends only on the number ε\varepsilon, the set JJ and the random variables (Xj)jJ(X_{j})_{j\in J}. Given a vector yy as we have found above, we can approximate it with some vector y0NJy_{0}\in\mathcal{N}_{J} so that yy02ε\|y-y_{0}\|_{2}\leq\varepsilon. By (5.1) we have

This implies that all but at most δs\delta s indices ii in II satisfy the inequality

Let us denote the set of these indices by I0II_{0}\subseteq I. The bound in (5.3) can be simplified as

Indeed, this estimate follows from the two bounds

In particular, by choosing δ=δ(q)>0\delta=\delta(q)>0 sufficiently small, (5.3) implies

Together with (5.2) this yields by triangle inequality that

Summarizing, we have shown that the event {E0c and E}\{\mathcal{E}_{0}^{c}\text{ and }\mathcal{E}\} implies the following event: there exists a number sNs\leq N, disjoint index subsets I0,J[N]I_{0},J\subseteq[N] with sizes I0(1δ)s|I_{0}|\geq(1-\delta)s, Jδs|J|\leq\delta s, and a vector y0NJy_{0}\in N_{J} such that

It will now be easy to estimate the probability of this event. First of all, for each fixed vector y0Sn1y_{0}\in S^{n-1} and each index ii, the moment assumptions (2.2) imply via Markov’s inequality that

where the last line follows from our choice of qq and rr^{\prime\prime}. By independence, for each fixed vector y0Sn1y_{0}\in S^{n-1} and a fixed index set I0[N]I_{0}\subseteq[N] of size I0(1δ)s|I_{0}|\geq(1-\delta)s we have

Then we bound the probability of event {E0c and E}\{\mathcal{E}_{0}^{c}\text{ and }\mathcal{E}\} by taking the union bound over all ss, I0I_{0}, JJ as above, conditioning on the random variables (Xj)jJ(X_{j})_{j\in J} (which fixes the ε\varepsilon-net NJ\mathcal{N}_{J}), taking the union bound over the choice of y0NJy_{0}\in\mathcal{N}_{J}, and finally evaluating the probability for using (5.4). This way we obtain via Stirling’s approximation of the binomial coefficients that

This completes the proof of Theorem 5.1. ∎

Approximating covariance matrices

In this final section, we deduce our main result on the approximation of covariance matrices for random vectors with finite moments.

In our proof of Theorem 6.1, we can clearly assume that K=L=1K=L=1 in the moment assumptions (2.2) by rescaling the vectors XiX_{i}. So in the rest of this section we suppose XiX_{i} are such random vectors.

For a level B>0B>0 and a vector xSn1x\in S^{n-1}, we consider the (random) index set of large coefficients

Let t1t\geq 1. With probability at least 1Ct0.9q1-Ct^{0.9q}, one has

Solving for EB|E_{B}| we obtain the bound as in the conclusion. ∎

The truncation argument described in in the beginning of proof of Proposition 4.3 reduces the problem to estimating the contribution to the sum of large coefficients. Denote

The truncation argument yields that for every B1B\geq 1, one has with probability at least 1δ/31-\delta/3 that

so that, using Lemma 6.2, with probability at least 1δ/31-\delta/3 we have

It remains to estimate the right hand side of (6) using (6.3).

An estimate of I2I_{2} follows from Theorem 5.1 for some p=p(q)(4,q)p=p(q)\in(4,q) to be determined later. Note that enlarging EBE_{B} can only make I2I_{2} and I3I_{3} larger. So without loss of generality we can assume that EB4|E_{B}|\geq 4 as required in Theorem 5.1. This way, we obtain with probability at least 1δ/31-\delta/3 that

Since we are free to choose p=p(q)p=p(q) in the interval (4,q)(4,q), we choose the middle of the interval, p=(q+4)/2p=(q+4)/2. Returning to (6) we conclude that

This completes the proof of Theorem 6.1. ∎

References