Covariance estimation for distributions with $2+\varepsilon$ moments
Nikhil Srivastava, Roman Vershynin
Introduction
Our goal is to estimate from a sample taken from the same distribution as . A classical unbiased estimator for is the sample covariance matrix
A basic question is to determine the minimal sample size which guarantees that is accurately estimated by . More precisely, for a given accuracy , we are interested in the minimal so that
where denotes the spectral (operator) norm. Replacing by and by , we reduce the problem to the distributions for which , that is, to isotropic distributions.
2 Sampling from isotropic distributions
For obvious-dimensional reasons, one must have . Rudelson’s remarkably general result (R , see Vtutorial , Section 4.3) yields that if almost surely, then
where the notation hides the dependence on here and thereafter. It is well known that the logarithmic oversampling factor cannot be removed from (1) in general, for example, if the distribution is supported on points; see Section 1.8.
holds for every distribution that satisfies
This result can be obtained by a standard covering argument; see Vtutorial , Section 4.3.
It is an open problem to describe the distributions for which the logarithmic oversampling is not needed, that is, for which . The gap between sub-Gaussian distributions where this bound holds and discrete distributions on points where it fails is quite large.
It is already a difficult problem to relax the sub-Gaussian moment assumption (2) to anything weaker while keeping . A major step was made by Adamczak et al. ALPT , who showed that still holds (in fact, with high probability) under the sub-exponential moment assumptions
The second author of the present paper speculated in Vcovariance that should hold for a much wider class of distributions than sub-exponential, perhaps for all distributions with moments. (The second moment—the variance—is assumed to be finite by the nature of the problem, as otherwise the covariance matrix is not defined.) The goal of the the current paper is to provide a result of this type.
3 Covariance estimation
Returning to the covariance estimation problem, we deduce the following.
the sample covariance matrix obtained from independent copies of satisfies
This result follows by applying Theorem 1.1 for the independent copies of the random vectors instead of , and by multiplying the matrix in (4) by on the left and on the right. Thus, for distributions satisfying (SR) we conclude that the minimal sample size for the covariance estimation is .
Let us illustrate these results with two important examples.
4 Sampling from log-concave distributions and convex sets
where are absolute constants. This is obviously stronger than assumption (SR), so Corollary 1.2 applies.
We conclude that the minimal sample size for estimating the covariance matrix of a log-concave distribution is . This matches the bound obtained by Adamczak et al. ALPT , though it should be noted that the guarantee of ALPT holds with probability that converges to exponentially fast as , whereas ours holds only in expectation. We have not tried to obtain probability bounds of this type; note, however, that under our general assumption (SR), the probability cannot converge to faster than at a polynomial rate in .
5 Sampling from product distributions
A distribution does not have to be log-concave in order to satisfy the regularity assumptions in Theorem 1.1 and Corollary 1.2. For example, all product distributions with finite moments have the required regularity property. We can deduce this from the following thin shell estimate:
The factor implicit in (5) depends only on and on the bound on the th moments.
The proof of Proposition 1.3 is given in the Appendix.
Applying Chebyshev’s inequality together with (5), we obtain for that
Thus for we get a sub-linear tail, as required in the regularity assumption (SR).
6 Extreme eigenvalues
Theorem 1.1 states that, for sufficiently large , all eigenvalues of the sample covariance matrix are concentrated near . It is easy to extend this to a result that holds for all , as follows.
Here , and , denote the smallest and the largest eigenvalues of , respectively.
We deduce this result in Section 3. One can view (6) as a nonasymptotic form of the Bai–Yin law for the extreme eigenvalues of sample covariance matrices BY . This law, associated with the works of Geman, Bai, Yin, Krishnaiah and Silverstein applies for product distributions, specifically for random vectors with i.i.d. components with zero mean, unit variance and finite fourth moment. For such distributions one has asymptotically almost surely that
as and ; see the rigorous statement in BY . This limit law is sharp. On the other hand, inequalities (6) hold in any fixed dimensions and for general distributions (as in Theorem 1.1), without any independence requirements for the coordinates.
Comparing (6) with (7) one can ask about the optimal value of the exponent , in particular whether . In a recent paper ALPTsharp , Adamczak et al. obtained the optimal exponent for log-concave distributions, and more generally for sub-exponential distributions in the sense of (1.2). As (1.2) implies (SR) with and , Theorem 1.1 recovers a bound of as .
[(Random matrices with independent rows)] Corollary 1.4 can be interpreted as a result about the spectrum of random matrices with independent rows. Indeed, if is the matrix with rows , then . So the singular values of the matrix are the same as the eigenvalues of the matrix , and they are controlled as in (6). In particular, under the regularity assumption (SR) on we obtain that
where , and is as in Corollary 1.4.
7 Smallest eigenvalue
Our proof of Theorem 1.1 consists of two separate arguments for upper and lower bounds for the spectrum of the sample covariance matrix. It turns out that the full power of the strong regularity assumption (SR) is not needed for the lower bound. It suffices to assume moments for one-dimensional marginals rather than for marginals in all dimensions. This is only slightly stronger than the isotropy assumption, which fixes the second moments of one-dimensional marginals, and it broadens the class of distributions for which the result applies. We state this as a separate theorem.
the minimum eigenvalue of the sample covariance matrix satisfies
[(Moments vs. tails)] We have chosen to write (WR) in terms of moments rather than in terms of tail bounds as in (SR). By integration of the tails one can check that, for any given , (SR) with parameter implies (WR) with parameter .
In the remainder of the paper we will use (WR) for theorems regarding only the smallest eigenvalue and (SR) for theorems which involve the largest one.
[(Product distributions with moments)] Many distributions of interest satisfy (WR). For example, let have i.i.d. components with zero mean, unit variance and finite moment. Then a standard application of symmetrization and Khintchine’s inequality (or a direct application of Rosenthal’s inequality Ros , see FHJSZ ) shows that one-dimensional marginals of also have bounded moments; that is, (WR) holds.
In the context of the Bai–Yin law discussed in Section 1.6, this indicates that the smallest eigenvalue of a random matrix can be approximately controlled [as in (6)] even if the fourth moment is infinite. However, as we already recalled, four moments are necessary to control the largest eigenvalue in the classical Bai–Yin law BSY .
[(Covariance estimation)] Theorem 1.5 can be used to obtain a lower estimate for the covariance matrix under the weak regularity assumption (WR).
8 Optimality of the regularity assumptions
Let us briefly mention two simple and known examples that illustrate the role of regularity assumptions (SR) and (WR) in the control of the largest and smallest eigenvalues, respectively.
For the largest eigenvalue as in Theorem 1.1, it is not sufficient to put a regularity assumption of the type (SR) only on one-dimensional marginals, as it is done in Theorem 1.5 for the smallest eigenvalue. Even the following very strong (exponential) moment assumption is insufficient:
which contradicts the conclusion of Theorem 1.1. This example is essentially due to Aubrun; see ALPT , Remark 4.9.
It is not clear whether Theorem 1.1 would hold if, in addition to moments on one-dimensional marginals, one puts a total boundedness assumption
A conjecture of this type is discussed in Vcovariance where a version of the theorem is proved under this assumption, with but with an additional oversampling factor.
9 The argument: Randomizing the spectral sparsifier
Our proof of Theorem 1.1 consists of randomizing the spectral sparsifier invented by Batson, Spielman and Srivastava BSS ; see SPhD . The randomization makes the spectral sparsifier appear naturally in the context of random matrix theory. The method is based on evaluating the Stieltjes transform of while making rank one updates. However, in contrast to typical methods of random matrix theory (and to the spectral sparsifier itself), we shall evaluate the Stieltjes transform at random real points.
Let us illustrate the method by working out a crude upper bound for the largest eigenvalue of . Equivalently, we want to show that a general Wishart matrix has all eigenvalues bounded by . We evaluate the Stieltjes transform
where denote the eigenvalues of . This function has singularities at the points , and it vanishes at infinity. So the largest eigenvalue of is the largest where . However, such is difficult to compute. So we soften this quantity by considering the largest number that satisfies
where is a fixed sensitivity parameter, for example, .
This is the same problem as in BSS , except the eigenvalues and hence the soft spectral edge are now random points. The randomized problem is more difficult as we note below.
As opposed to the largest eigenvalue of , the soft spectral edge can be computed inductively using rank-one updates to the matrix; will move to the right by a random amount at each step as we replace by . Initially, = 0 so . It suffices to prove that the moves by on average at each step:
This reduces proving (12) to a probabilistic problem, which is essentially governed by the distribution of the random vector .
The difficulty is that we are facing a nonlinear inverse problem. Indeed, for a fixed it is not difficult to compute the expectation of from (13), and in particular to bound the expectation by ; this is done in BSS . However, we require the identity to hold deterministically, because the largest that satisfies it defines the soft spectral edge of as in (11). The task of computing the expectation of a random number for which is a highly nonlinear inverse problem BN , Section 4.1. This is where some regularity of with respect to the eigenstructure of becomes essential. A technical part of our argument developed in most of the remaining sections is to realize and prove that a small amount or regularity encoded by (SR) or (WR) is already sufficient to control the solution to the inverse problem, and ultimately to control the spectral edges of .
10 Organization of the paper
The rest of the paper is organized as follows. We start with the somewhat simpler Theorem 1.5 for the smallest eigenvalue in Section 2. A corresponding result for the largest eigenvalue, Theorem 3.1, is proved in Section 3. Corollary 1.4 is also deduced in Section 3. Combining Theorems 1.5 and 3.1 in Section 4, we obtain the main Theorem 1.1 on the spectral norm. In the Appendix, we prove Proposition 1.3 on the regularity of product distributions.
The lower edge
We begin by proving Theorem 1.5 about the the lower edge of the spectrum, which is slightly simpler and requires fewer assumptions than the upper edge. As in BSS , the tool that we use to do this is the lower Stieltjes transform
where . Then for every symmetric matrix , one has
Iterating Theorem 2.1 easily yields a proof of Theorem 1.5 as follows. {pf*}Proof of Theorem 1.5 Let and for . Setting , we find that
Applying Theorem 2.1 inductively to , we find that
where we take the conditional expectation with respect to the random vector , given the random vectors , that is, given . Summing up these bounds yields
For , the bound becomes . Substituting the value of and replacing by gives the promised result.
We begin by reducing the feasibility for a shift to an inequality involving two quadratic forms. The following lemma appeared in BSS , and we include it with a proof for completeness.
isTo ease the notation, we sometimes write instead of .
Combining these estimates, we see that (2) holds as long as
which we can rearrange into (3) observing that all quadratic forms involved are positive.
The proof is based on regularity properties of the quadratic forms and , which we state in the following two lemmas.
;
.
i(i) Let denote the eigenvectors of ; then
Using these for every term in (4), we complete the proof of (i).
(ii) Similar to (i), noting that the numerator and denominator of are increasing in .
(i) As in the proof of the previous lemma, let denote the eigenvectors of . By isotropy we have
For the moment bound we use Minkowski’s inequality to obtain
We can now finish the proof of Lemma 2.3.
Proof of Lemma 2.3 First observe that by construction,
If either of the indicators in the definition of the shift is zero, then , which is trivially feasible, and we are done. So assume both indicators are nonzero, that is, and . By Lemma 2.2, it suffices to prove inequality (3), which is equivalent to
We can show this by replacing with zero using Lemma 2.4:
We now complete the proof of Theorem 2.1 by using the regularity properties of to show that the expectation of , as defined in Lemma 2.3, is large. Roughly speaking, this happens because (1) is defined to be slightly less than whenever both and are not too large; (2) that event occurs with very high probability when is sufficiently small; (3) the expectation of equals .
The upper edge
In this section we establish the following estimate for the expected largest eigenvalue, analogous to Theorem 1.5 for the smallest one.
the maximum eigenvalue of the sample covariance matrix satisfies
We shall control the largest eigenvalue of a symmetric matrix using the (upper) Stieltjes transform
Similarly to our argument for the lower edge, for a sensitivity value , we define the upper soft spectral edge to be the largest for which
Suppose is an isotropic random vector satisfying the strong regularity assumption (SR) for some . Assume and
where . Then for every symmetric matrix , one has
Iterating Theorem 3.2 yields a proof of Theorem 3.1. {pf*}Proof of Theorem 3.1 The argument is similar to the proof of Theorem 1.5 given in Section 2. We set . Then we start with where , and we inductively apply Theorem 3.2 for to obtain
For , the bound becomes . Substituting the value of and replacing by gives the promised result.
The above proof works for and thus for , but it may be extended to smaller as follows.
Proof of Corollary 1.4 In the proof of Theorem 3.1, we have shown that for every and every positive integer , we have
where . Optimizing in , we apply this estimate with when and with when to obtain
Combining these, for every and we conclude that
The definition of the soft spectral edge along with monotonicity of the Stieltjes transform implies that
As in our argument for the lower edge, we begin by reducing the feasibility for a shift to an inequality involving two quadratic forms.
Note that so that all quadratic forms are positive, and assume since otherwise the claim is trivial. As in the proof of Lemma 2.2, we use the Sherman–Morisson formula to write
Rearranging reveals that exactly when (3.3) holds.
for all positive matrices (this can be seen, e.g., using the Courant–Fischer theorem). Applying this fact to (12), we see that it suffices to have
which follows from (3.3) and .
We will reason about the two quantities and separately, producing two separate shifts and for them and eventually combining these into a single , as required by Lemma 3.3.
For some fixed parameter , let us define and to be the smallest nonnegative numbers such which satisfy
For and for a random vector , Lemmas 3.4 and 3.6 will allow us to control the expected value of each of these shifts, so
whenever the sensitivity parameter is sufficiently small. From this we will obtain Theorem 3.2 quickly as follows.
Proof of Theorem 3.2 Let , so the condition of Lemma 3.3 holds. Consider the shifts and defined above. By (13), we have
Moreover, a quick inspection of the quadratic forms in Lemma 3.3 shows that and are decreasing in , and hence
Then Lemma 3.3 guarantees that is a feasible upper shift, which implies by (10) that
Furthermore, (14) yields a bound on the expected shift
which gives conclusion (8) of Theorem 3.2.
It remains to note that Lemmas 3.4 and 3.6 only guarantee that the bounds (14) hold when the sensitivity is sufficiently small, namely . With , we can simplify this inequality into the assumption of Theorem 3.2.
The rest of this section is devoted to controlling the shifts and .
It is easy to check that the proofs of Lemmas 3.4 and 3.6 which follow, and consequently Theorem 3.2, only require
then the shift satisfies
Let and denote the eigenvectors and eigenvalues of , and let . We know that , and is the smallest nonnegative number satisfying
Rescaling everything by and setting so that
the problem becomes equivalent to bounding the least for which
Applying the following, somewhat more general, probabilistic lemma to , we conclude that
Substituting gives the promised bound.
for all subsets and some constants . Consider positive numbers such that
Let be the minimal positive number such that
For simplicity of calculations, assume for the moment that the values of all are dyadic, that is,
and is the smallest positive number such that
We estimate by replacing it with a bigger but easier quantity . Define to be the smallest positive number such that, for every dyadic , one has
the definition of given in (17) yields
Let . For every , one has
Since by definition, we have
The promised bound for general (nondyadic) follows by rounding each down to the nearest power of and replacing by .
[[Necessity of the strong regularity assumption (SR)]] The preceding lemma is the only place in the proof where the full power of (SR) is used. To see that it is necessary, consider the following situation. Fix any , and let so that . Then the smallest for which is just
then the shift satisfies
It will be more convenient to work with the quadratic form
The reason for working with rather than directly with in Lemma 3.3 is that is decreasing in ; this monotonicity is required when arguing that the maximum of the two shifts is feasible in the proof of Theorem 3.2.
We begin by recording some regularity properties of .
;
(i) is analogous to Lemma 2.4. In a similar way, we show that all eigenvalues of satisfy , which implies the comparison inequality
Denoting the eigenvectors of , we express
i(ii) We note that (20) can be rearranged as a convex combination of .
(iii) We apply Minkowski’s inequality to obtain
Now a simple integration of tails implies that each
which concludes the proof. Next, we see how the regularity properties of translate into the corresponding properties of :
(i) By definition of and using (19), we have for all ,
This probability can be controlled using Lemma 3.7(iii) and Markov’s inequality, so we obtain
as . Integration of tails yields
(ii) Let denote the smaller solution of the quadratic equation
whenever a solution exists. In this case and Lemma 3.7(i) yields that
By (19), this yields . By definition of , this in turn implies that
An elementary calculation shows that if , then the solution exists and satisfies
where we used Lemma 3.7(i) in the last step.
We can now complete the proof of Lemma 3.6.
By Lemma 3.8(ii), we have . Next, we estimate using Hölder’s inequality,
The two terms here can be estimated using Lemma 3.8(i) and Lemma 3.7 along with Markov’s inequality,
Finally, we set and use the assumptions and to conclude that . Together with this implies
Although for convenience of application Lemma 3.6 is stated under the strong regularity assumption (SR), the latter is not used in the proof. The argument above uses only the weak regularity assumption (WR).
The spectral norm
In this section we prove Theorem 1.1 by showing that whenever are independent and satisfy (SR), the spectral norm estimate
obtained in Theorems 1.5 and 3.1. The basic idea is to show using independence that
is concentrated near its expectation of . Combining this with
which follows immediately from (22), yields (21).
We rely on the following elementary proposition regarding sums of independent random variables.
Postponing the proof of Proposition 4.1, we use this fact to control
Proof of Theorem 1.1 Assume the random vectors are isotropic and satisfy (SR) with parameters . This implies that the random variables
satisfy the requirements of Proposition 4.1 with parameters . It follows that
Replacing by and taking
always satisfies (26). This completes the proof of the theorem.
Proof of Proposition 4.1 Fix a parameter , and decompose
By Jensen’s inequality, independence and the bound on , we have
Moreover, by triangle and Jensen’s inequalities,
Choosing and using the assumption on , one easily checks that
Appendix: Proof of Proposition 1.3
In this section we prove Proposition 1.3, which states that product distributions satisfy the regularity assumption in Theorem 1.1. Note that this result and its proof are not needed in the proof of Theorem 1.1.
The contribution of the diagonal of to this sum is
Denote by the matrix with diagonal removed; then
This inequality can be obtained from general decoupling results; see dlPG , Theorem 3.1.1; a simple and well-known proof of (2) is given in Vdecoupling .
Therefore, by conditioning on we obtain from (2) that
Since equals without the diagonal, the triangle inequality yields
Since , we can replace by , so
Putting (1), (3) and (4) together, we arrive at the inequality
Put in different words, the random variable satisfies the inequality
Solving this quadratic inequality we obtain that
In order to bound we consider
Finally, by definition of and using the triangle inequality and bounds (7), (6), we conclude that
Acknowledgments
The authors are grateful to the referees whose comments improved the presentation of the paper.