How close is the sample covariance matrix to the actual covariance matrix?
Roman Vershynin
Introduction
Estimation of covariance matrices of high dimensional distributions is a basic problem in multivariate statistics. It arises in diverse applications such as signal processing , genomics , financial mathematics , pattern recognition , geometric functional analysis and computational geometry . The classical and simplest estimator of a covariance matrix is the sample covariance matrix. Unfortunately, the spectral theory of sample covariance matrices has not been well developed except for product distributions (or affine transformations thereof) where one can rely on random matrix theory for matrices with independent entries. This paper addresses the following basic question: how well does the sample covariance matrix approximate the actual covariance matrix in the operator norm?
The use of the operator norm in this problem allows one a good grasp of the spectrum of , as each eigenvalue of would lie within from the corresponding eigenvalue of .
It is common for today’s applications to operate with increasingly large number of parameters , and to require that sample sizes be moderate compared with . As we impose no a priori structure on the covariance matrix, we must have for dimension reasons. Note that for some structured covariance matrices, such as sparse or having an off diagonal decay, one can sometimes achieve smaller than and even comparable to , by transforming the sample covariance matrix in order to adhere to the same structure (e.g. by shrinkage of eigenvalues or thresholding of entries). We will not consider structured covariance matrices in this paper; see e.g. and .
2. Two examples
The most extensively studied model in random matrix theory is where is a random vector with independent coordinates. However, independence of coordinates can not be justified in some important applications, and in this paper we shall consider general random vectors. Let us illustrate this point with two well studied examples.
Suppose further that the covariance matrix of can be approximated by the sample covariance matrix for some moderate sample size . Such an approximation means simply that a random subset of vectors taken from the tight frame independently and with replacement is still an approximate tight frame:
In other words, a small random subset of a tight frame is still an approximate tight frame; the size of this subset does not even depend on the frame size . For applications of this type of results in communications see .
3. Sub-gaussian and sub-exponential distributions
then the optimal sample size is .
4. Distributions with finite moments
then the minimal sample size that guarantees approximation (1.1) is . The second moment assumption in (1.5) is very weak; it is equivalent to the boundedness of the covariance matrix, . The logarithmic oversampling factor is necessary in this extremely general result, as can be seen from the example of the uniform distribution on the set of vectors of Euclidean length . The coupon collector’s problem calls for the size in order for the sample to contain all these vectors, which is obviously required for a nontrivial covariance approximation.
In this paper we prove the Conjecture up to an iterated logarithmic factor.
1. The notation means that where depends only on the parameters ; see Section 2.3 for more notation. The logarithms are to the base . We put the restriction only to ensure that ; Theorem 1.2 and other results below clearly hold for dimensions even without the iterated logarithmic factors.
2. It follows that for every , the desired approximation is guaranteed if the sample has size
3. A similar result holds for independent random vectors that are not necessarily identically distributed; we will prove this general result in Theorem 6.1.
4. The boundedness assumption in (1.6) can often be weakened or even dropped by a simple modification of Theorem 1.2. This happens, for example, if holds with high probability, as one can apply Theorem 1.2 conditionally on this event. We refer the reader to a thorough discussion of the boundedness assumption in Section 1.3 of .
5. Extreme eigenvalues of sample covariance matrices
Theorem 1.2 can be used to analyze the spectrum of sample covariance matrices . The case when the random vector has i.i.d. coordinates is most studied in random matrix theory. Suppose that both while the aspect ratio . If the coordinates of have unit variance and finite fourth moment, then clearly . The largest eigenvalue then converges a.s. to , and the smallest eigenvalue converges a.s. to , see . For more on the extreme eigenvalues in both asymptotic regime () and non-asymptotic regime ( fixed), see .
Without independence of the coordinates, analyzing the spectrum of sample covariance matrices becomes significantly harder. Suppose that . For sub-exponential distributions, i.e. those satisfying (1.4), it was proved in that
(A weaker version with extra factors was proved earlier by the same authors in .) Under only finite moment assumption (1.6), Theorem 1.2 clearly yields
Note that for large exponents , the factor becomes close to .
6. Norms of random matrices with independent columns
This follows from a result leading to Theorem 1.2, see Corollary 5.2. The bound is optimal up to the factor for matrices with i.i.d. entries, because the operator norm is bounded below by the Euclidean norm of any column and any row. For random matrices with independent entries, estimate (1.7) follows (under the fourth moment assumption) from more general bounds by Seginer and Latala , and even without the factor. Without independence of entries, this bound was proved by the author for products of random matrices with independent entries and deterministic matrices, and also without the factor.
7. Organization of the rest of the paper
The heart of the paper are Sections 3 and 4. In Section 3 we study the structure of series that diverge faster than the iterated logarithm. This structure is used in Section 4 to deduce a decoupling principle. In Section 5 we apply the decoupling principle to norms of random matrices. Specifically, in Theorem 5.1 we estimate the norm of uniformly over subsets . We interpret this in Corollary 5.2 as a norm estimate for random matrices with independent columns. In Section 6, we deduce the general form of our main result on approximation of covariance matrices, Theorem 6.1.
Acknowledgement
The author is grateful to the referee for useful suggestions.
Outline of the method and preliminaries
In this expression we may recongize a stochastic process indexed by vectors on the sphere. For each fixed , we have to control the sum of independent random variables with finite moments. Suppose the bad event occurs – for some , this sum is significantly larger than . Unfortunately, because of the heavy tails of these random variables, the bad event may occur with polynomial rather than exponential probability . This is too weak to control these sums for all simultaneously on the -dimensional sphere, where -nets have exponential sizes in . So, instead of working with sums of independent random variables, we try to locate some structure in the summands responsible for the largeness of the sum.
More generally, we shall study the structure of divergent series , where . Let us first suppose that the series diverges faster than logarithmic function, thus
Comparing with the harmonic series we see that the non-increasing rearrangement of the coefficients at some point must be large:
In other words, one can find large terms of the sum: there exists an index set of size and such that for . This collection of large terms forms a desired structure responsible for the largeness of the series . Such a structure is well suited to our applications where are independent random variables, . Indeed, the events are independent, and the probability of each such event is easily controlled by finite moment assumptions (2.2) through Markov’s inequality. This line was developed in , but it clearly leads to a loss of logarithmic factor which we are trying to avoid in the present paper.
We will work on the next level of precision, thus studying the structure of series that diverge slower than the logarithmic function but faster than the iterated logarithm. So let us assume that
In Proposition 3.1 we will locate almost the same structure as we had for logarithmically divergent series, except up to some factor , as follows. For some there exists an index set of size , such that
2. Decoupling
The structure that we found is well suited to our application where are independent random variables . In this case we have
The probability that this happens is again easy to control using independence of for fixed , finite moment assumptions (2.2) and Markov’s inequality. Since there are number of ways to choose the subset , the probability of the event in (2.1) is bounded by
where the last inequality follows because and since .
Our next task is to unfix . The exponential probability estimate we obtained allows us to take the union bound over all in the unit sphere of any fixed -dimensional subspace, since this sphere has an -net of size exponential in . We can indeed assume without loss of generality that the vector in our structural event (2.1) lies in the span of which is -dimensional; this can be done by projecting onto this span if necessary. Unfortunately, this obviously makes depend on the random vectors and destroys the independence of random variables . This hurdle calls for a decoupling mechanism, which would make in the structural event (2.1) depend on some small fraction of the vectors . One would then condition on this fraction of random vectors and use the structural event (2.1) for the other half, which would quickly lead to completion of the argument.
Our decoupling principle, Proposition 4.1, is a deterministic statement that works for fixed vectors . Loosely speaking, we assume that the structural event (2.1) holds for some in the span of , and we would like to force to lie in the span of a small fraction of these . We write as a linear combination . The first step of decoupling is to remove the “diagonal” term from this sum, while retaining the largeness of . This task turns out to be somewhat difficult, and it will force us to refine our structural result for divergent series by adding a domination ingredient into it. This will be done at the cost of another factor. After the diagonal term is removed, the number of terms in the sum for will be reduced by a probabilistic selection using Maurey’s empirical method.
3. Notation and preliminaries
We will use the following notation throughout this paper. and will stand for positive absolute constants; will denote a quantity which only depends on the parameter , and similar notation will be used with more than one parameter. For positive numbers and , the asymptotic inequality means that . Similarly, inequalities of the form mean that . Intervals of integers will be denoted by for . The cardinality of a finite set is denoted by . All logarithms will be to the base .
We can view our goal as establishing a law of large numbers in the operator norm, and with quantitative estimates on convergence. Thus we would like to show that the approximation error
4. Sub-gaussian distributions
A solution to the approximation problem is well known and easy for sub-gaussian random vectors, those satisfying (1.3). The optimal sample size here is proportional to the dimension, thus . For the reader’s convenience, we recall and prove a general form of this result.
One should compare this with our main result, Theorem 1.2, which yields almost the same conclusion under only finite moment assumptions on the distribution, except for an iterated logarithmic factor and a slight loss of the exponent (the latter may be inevitable when dealing with finite moments).
The well known proof of Proposition 2.1 is based on Bernstein’s deviation inequality for independent random variables and an -net argument. The latter allows to replace the sphere in the computation of the norm in (2.3) by a finite -net as follows.
Let be a Hermitian matrix, and let be an -net of the unit Euclidean sphere for some . Then
Let us choose for which , and choose which approximates as . It follows by the triangle inequality that
Without loss of generality, we can assume that in the sub-gaussian assumption (1.3) we have by replacing by . Identity (2.3) expresses the norm in question as a supremum over the unit sphere . Next, Lemma 2.2 allows to replace the sphere in (2.3) by its -net at the cost of an absolute constant factor. Moreover, we can arrange so that the net has size ; this follows by a standard volumetric argument (see [17, Lemma 9.5]).
Now we unfix . Using (2.4) for each in the net , we conclude by the union bound that the event
Now if we choose , this probability is further bounded below by as required. By the reduction from the sphere to the net mentioned in the beginning of the argument, this completes the proof. ∎
A truncation argument of J. Bourgain reduces the approximation problem to finding an upper bound on
Consider random vectors which satisfy moment assumptions (2.2) for some and some , . Then, for every , with probability at least one has
For most part of our argument (through decoupling), we treat as fixed non-random vectors. The only property we require from is that they are almost pairwise orthogonal. For random vectors, an almost pairwise orthogonality easily follows from the moment assumptions (2.2):
Consider random vectors which satisfy moment assumptions (2.2) for some and some , . Then, for every , with probability at least one has
Structure of divergent series
In this section we study the structure of series which diverge slower than the logarithmic function but faster than an iterated logarithm. This is summarized in the following result.
Then there exist a positive integer and a subset of indices such that the following holds. Given a vector such that , one can find a further subset with the following two properties.
(i) (Regularity): the sizes and satisfy
Furthermore, we can make with arbitrarily large by making the dependence on implicit in the assumption (3.1) sufficiently large.
which will be crucial in our application to decoupling.
2. The freedom to choose in Proposition 3.1 ensures that the structure located in the set is in a sense hereditary; it can pass to subsets . The domination of by on will be crucial in the removal of the diagonal terms in our application to decoupling.
We now turn to the proof of Proposition 3.1. Heuristically, we will first find many (namely, ) sets on which the coefficients are large as in (ii), then choose one that satisfies the regularity condition (i). This regularization step will rely on the following elementary lemma.
Let be a positive integer. Consider a nonempty subset with size . Then, for every , there exist elements that satisfy the following two properties.
We will find , as some consecutive terms of the following geometric progression. Define to be the (unique) element such that
We will only need to consider terms of this progression, where . Since , we have . On the other hand, . It follows that .
We claim that there exists a term such that
The terms and for which (3.2) holds clearly satisfy (i) and (ii) of the conclusion. By increasing and decreasing if necessary we can assume that . This completes the proof. ∎
We shall prove the following slightly stronger statement. Consider a sufficiently large number
We shall prove that the conclusion of the Proposition holds with (ii) replaced by:
We will construct and in the following way. First we decompose the index set into blocks on which the coefficients have similar magnitude; this is possible with blocks. Using the assumption , one easily checks that many (at least ) of the blocks have large contribution (at least ) to the sum . We will only focus on such large blocks in the rest of the argument. At this point, the union of these blocks could be declared . We indeed proceed this way, except we first use Regularization Lemma 3.2 on these blocks in order to obtain the required regularity property (ii). Finally, assume we are given coefficients with small sum as in the assumption. Since the coefficients are large on by construction, the pigeonhole principle will yield (loosely speaking) a whole block of coefficients where will dominate as required, . We declare this block and complete the proof. Now we pass to the details of the argument.
Step 1: decomposition of into blocks. Without loss of generality,
Indeed, we can clearly assume that and . The estimate follows from the assumption: . Furthermore, the contribution of the small coefficients to the norm is at most , while by the assumption . Hence we can ignore these small coefficients by replacing with the subset corresponding to the coefficients .
We decompose into disjoint subsets (which we call blocks) according to the magnitude of , and we consider the contribution of each block to the norm :
By our assumptions on , there are at most nonempty blocks . As , Markov’s inequality yields for all that
Only the blocks with large contributions will be of interest to us. Their number is
and we let if it happens that all . We claim that there are many such blocks:
Indeed, by the assumption and using (3.6) we can bound
Step 2: construction of the set . As we said before, we are only interested in blocks with large contributions . We collect the indices of such blocks into the set
Since the definition of implies that , we have . Then we can apply Regularization Lemma 3.2 to the set . Thus we find two elements satisfying
has size . Since by our choice of we can assume that , we obtain
satisfies the conclusion of the Proposition.
Step 3: sizes of the coefficients for . Let us fix . From the definition of we know that the contribution is large: . One consequence of this is a good estimate of the size of the block . Indeed, the above bound together with (3.6) this implies
Another consequence of the lower bound on is the required lower bound on the individual coefficients . Indeed, by construction of the coefficients , are within the factor from each other. It follows that
In particuar, since by construciton , we have , which implies
We have thus proved the required lower bound (3.3).
Step 4: Construction of the set , and sizes of the coefficients for . Now suppose we are given a vector with . We will have to construct a subset as in the conclusion, and we will do this as follows. Consider the contribution of the block to the norm :
On the one hand, the sum of all contributions is bounded as . On the other hand, there are many terms in this sum: as we know from (3.9). Therefore, by the pigeonhole principle some of the contributions must be small: there exists such that
This in turn implies via Markov’s inequality that most of the coefficients for are small, and we shall declare these set of indices . Specifically, since and , using Markov’s inequality we see that the set
We have thus proved the required lower bound (3.4).
Step 5: the sizes of the sets and . It remains to check the regularity property (i) of the conclusion of the Proposition. We bound
We have thus proved the first inequality in (i) of the conclusion of the Proposition. Similarly, we bound
This completes the proof of (i) of the conclusion, and of the whole Proposition 3.1. ∎
Decoupling
Then the Decoupling Proposition 2.1 of implies that there exist disjoint sets of indices such that , and there exists a vector , such that
Results of this type are best suited for applications to random independent vectors . Indeed, the events that is large are independent for because does not depend on . The probability of each such event is easy to bound using the moment assumptions (2.2).
Then there exist nonempty disjoint sets of indices such that , and there exists a vector , such that
The proof of the Decoupling Proposition 4.1 will use Proposition 3.1 in order to locate the structure of the large coefficients . The following elementary lemma will be used in the argument.
It is easy to check that each extreme point of the convex set
has exactly nonzero coefficients which are equal to . Evaluating the linear form on these extreme points, we obtain
By replacing with we can assume without loss of generality that . By perturbing the vectors slightly we may also assume that are all different.
Step 1: separation and the structure of coefficients. Suppose the assumptions of the Proposition hold, and let us choose a vector which attains the supremum in (4.3). We denote
and without loss of generality we may assume that . We also denote
We choose parameter sufficiently small; its choice will become clear later on in the argument. At this point, we may assume that
We can use Structure Proposition 3.1 to locate the structure in the coefficients . To this end, we apply this result for and obtain a number and a subset of indices . We can also assume that is sufficiently large – larger than an arbitrary quantity which depends on .
Since a vector satisfies for all (in fact for all ), a separation argument for the convex hull yields the existence of a vector that satisfies
We express as a convex combination
We then read the conclusion of Structure Proposition 3.1 as follows. There exists a futher subset of indices such that the sizes and are regular in the sense that
and the coefficients on and are large:
Furthermore, we can make sufficiently large depending on , say .
Step 2: random selection. We will reduce the number of terms in the sum (4.5) defining using random selection, trying to bring this number down to about . As is usual in dealing with sums of independent random variables, we will need to ensure that all summands have controlled magnitudes. To this end, we have by the assumption, and we can bound through (4.7). Finally, we have an a priori bound on the coefficients of the convex combination. However, the latter bound will turn out to be too weak, and we will need instead. To make this happen, instead of the sets and we will be working on their large subsets and defines as
where is a sufficiently large quantity whose value we will choose later. By Markov’s inequality, this incurs almost no loss of coefficients:
We will perform a random selection on using B. Maurey’s empirical method . Guided by the representation (4.5) of as a convex combination, we will treat as probabilities, thus introducing a random vector with distribution
Finally, for , we consider the average of about such vectors:
We would like to think of as a random version of the vector . This is certainly true in expectation:
Also, like , the random vector is a convex combination of terms (now even with equal weights). The advantage of over is that it is a convex combination of much fewer terms, as . In the next two steps, we will check that is similar to in the sense that its norm is also well bounded above, and at least of the inner products are still nicely bounded below.
Step 3: control of the norm. By independence, we have
where the last inequality follows because and .
Since , (4.7) gives us the lower bound
Together with the assumption , this implies that
Since , we conclude that with probability at least , one has
Step 4: removal of the diagonal term. We know from (4.4) that for many terms . We would like to replace by its random version , establishing a lower bound for many terms . But at the same time, our main goal is decoupling, in which we would need to make the random vector independent of those terms . To make this possible, we will first remove from the sum (4.10) defining the “diagonal” term containing , and we call the resulting vector .
To make this precise, let us fix . We consider independent random vectors
Similarly to the definition (4.10) of , we define
As we said before, we would like to show that the random variable
is bounded below by a constant with high probability. First, we will estimate its mean
To estimate the terms in the right hand side, note that by (4.4) and by the assumption. Now is the crucial point when we use that dominate as in the second inequality in (4.8). This allows us to bound the “diagonal” term as
Step 5: control of the inner products. We would need a stronger statement than (4.14) – that is bounded below not only in expectation but also with high probability. We will get this immediately by Chebyshev’s inequality if we can upper bound the variance of . In a way similar to Step 3, we estimate
Now we need to estimate the various terms in the right hand side of (4).
We start with the estimate on the inner products, collecting them into
Recall that, by the construction of and of , we have and for . We use Lemma 4.2 on order statistics to obtain the bound
Finally, we use our weak orthonormality assumption (4.1) to conclude that
To complete the bound on the variance of in (4) it remains to obtain some good lower bounds on and . Since , (4.8) yields
Similarly we can bound the coefficients in (4): using (4.7) we have since since . But here we will not simply replace by , as we shall try to use to offset the term in the estimate on . To this end, we note that because . Therefore, using the last inequality in (4.6) and that , we have
Combining the estimates on , and , we conclude our lower bound (4) on the variance of as follows:
Combining this with the lower bound (4.14) on the expectation, we conclude by Chebyshev’s inequality the desired estimate
Step 6: decoupling. We are nearing the completion of the proof. Let us consider the good events
To show that each occurs with high probability, we note that by definition of and one has
From this and using (4.17) we conclude that
An application of Fubini theorem yields that with probability at least , at least of the events hold simultaneously. More accurately, with probability at least the following event occurs, which we denote by . There exists a subset of size such that holds for all . Note that using (4.9) and choosing sufficiently large we have
Recall that the norm bound (4.11) also holds with high probability . Hence with probability at least , both and this norm bound holds. Let us fix a realization of our random variables for which this happens. Then, first of all, by definition of we have
Next, we are going to observe that lies in the span of few vectors . Indeed, by construction lies in the span of the vectors for . Each such by construction lies in the span of the vectors , and of one vector . Finally, each such vector , again by construction, is either equal zero or , which in turn equals for some . Since holds, we have for all . This implies that there exists a subset (consisting of the indices as above) with the following properties. Firstly, does not contain any of indices ; in other words is disjoint from . Secondly, this set is small: . Thirdly, lies in the span of , . We claim that this set of indices,
satisfies the conclusion of the Proposition.
Since and are disjoint and , it follows that and are disjoint as required. Moreover, by (4.9) and by choosing , sufficiently large we have
When we combine this with (4.18) and choose sufficiently small depending on , we achieve
as required. Finally, we claim that the normalized vector
satisfies the conclusion of the Proposition. Indeed, we already noted that , as required. Next, for each we have
We can get rid of in this estimate using the bound
where the last inequality follows by choosing sufficiently small depending on , . This completes the proof of Decoupling Proposition 4.1. ∎
Norms of random matrices with independent columns
In this section we apply our decoupling principle, Proposition 4.1, to estimate norms of random matrices with independent columns. As we said, a simple truncation argument of J. Bourgain reduces the approximation problem for covariance matrices to bounding the norm of the random matrix uniformly over index sets . The following result gives such an estimate for random vectors with finite moment assumptions.
We can state Theorem 5.1 in terms of random matrices with independent columns.
Moreover, with the same probability all submatrices of simultaneously satisfy the following for all :
Consider the event that both required bounds (2.5) and (5.1) hold.
Recalling (2.5) and (5.1) we see that the assumptions of Decoupling Proposition 4.1 hold for , , , , for suitably large , , and for sufficiently small (to be chosen later). Applying Decoupling Proposition 4.1 we obtain disjoint index sets with sizes
and a vector such that
We will need to discretize the set of possible vectors . Let
and consider an -net of the sphere . As in known by a volumetric argument (see e.g. Lemma 2.6), one can choose such a net with cardinality
We can assume that the random set depends only on the number , the set and the random variables . Given a vector as we have found above, we can approximate it with some vector so that . By (5.1) we have
This implies that all but at most indices in satisfy the inequality
Let us denote the set of these indices by . The bound in (5.3) can be simplified as
Indeed, this estimate follows from the two bounds
In particular, by choosing sufficiently small, (5.3) implies
Together with (5.2) this yields by triangle inequality that
Summarizing, we have shown that the event implies the following event: there exists a number , disjoint index subsets with sizes , , and a vector such that
It will now be easy to estimate the probability of this event. First of all, for each fixed vector and each index , the moment assumptions (2.2) imply via Markov’s inequality that
where the last line follows from our choice of and . By independence, for each fixed vector and a fixed index set of size we have
Then we bound the probability of event by taking the union bound over all , , as above, conditioning on the random variables (which fixes the -net ), taking the union bound over the choice of , and finally evaluating the probability for using (5.4). This way we obtain via Stirling’s approximation of the binomial coefficients that
This completes the proof of Theorem 5.1. ∎
Approximating covariance matrices
In this final section, we deduce our main result on the approximation of covariance matrices for random vectors with finite moments.
In our proof of Theorem 6.1, we can clearly assume that in the moment assumptions (2.2) by rescaling the vectors . So in the rest of this section we suppose are such random vectors.
For a level and a vector , we consider the (random) index set of large coefficients
Let . With probability at least , one has
Solving for we obtain the bound as in the conclusion. ∎
The truncation argument described in in the beginning of proof of Proposition 4.3 reduces the problem to estimating the contribution to the sum of large coefficients. Denote
The truncation argument yields that for every , one has with probability at least that
so that, using Lemma 6.2, with probability at least we have
It remains to estimate the right hand side of (6) using (6.3).
An estimate of follows from Theorem 5.1 for some to be determined later. Note that enlarging can only make and larger. So without loss of generality we can assume that as required in Theorem 5.1. This way, we obtain with probability at least that
Since we are free to choose in the interval , we choose the middle of the interval, . Returning to (6) we conclude that
This completes the proof of Theorem 6.1. ∎