Covariance estimation for distributions with $2+\varepsilon$ moments

Nikhil Srivastava, Roman Vershynin

Introduction

Our goal is to estimate Σ\Sigma from a sample X1,,XNX_{1},\ldots,X_{N} taken from the same distribution as XX. A classical unbiased estimator for Σ\Sigma is the sample covariance matrix

A basic question is to determine the minimal sample size NN which guarantees that Σ\Sigma is accurately estimated by ΣN\Sigma_{N}. More precisely, for a given accuracy ε>0\varepsilon>0, we are interested in the minimal N=N(n,ε)N=N(n,\varepsilon) so that

where \|\cdot\| denotes the spectral (operator) norm. Replacing XX by Σ1/2X\Sigma^{-1/2}X and XiX_{i} by Σ1/2Xi\Sigma^{-1/2}X_{i}, we reduce the problem to the distributions for which Σ=I\Sigma=I, that is, to isotropic distributions.

2 Sampling from isotropic distributions

For obvious-dimensional reasons, one must have NnN\geq n. Rudelson’s remarkably general result (R , see Vtutorial , Section 4.3) yields that if X2=O(n)\|X\|_{2}=O(\sqrt{n}) almost surely, then

where the O()O(\cdot) notation hides the dependence on ε\varepsilon here and thereafter. It is well known that the logarithmic oversampling factor cannot be removed from (1) in general, for example, if the distribution is supported on O(n)O(n) points; see Section 1.8.

holds for every distribution that satisfies

This result can be obtained by a standard covering argument; see Vtutorial , Section 4.3.

It is an open problem to describe the distributions for which the logarithmic oversampling is not needed, that is, for which N=O(n)N=O(n). The gap between sub-Gaussian distributions where this bound holds and discrete distributions on O(n)O(n) points where it fails is quite large.

It is already a difficult problem to relax the sub-Gaussian moment assumption (2) to anything weaker while keeping N=O(n)N=O(n). A major step was made by Adamczak et al. ALPT , who showed that N=O(n)N=O(n) still holds (in fact, with high probability) under the sub-exponential moment assumptions

The second author of the present paper speculated in Vcovariance that N=O(n)N=O(n) should hold for a much wider class of distributions than sub-exponential, perhaps for all distributions with 2+ε2+\varepsilon moments. (The second moment—the variance—is assumed to be finite by the nature of the problem, as otherwise the covariance matrix is not defined.) The goal of the the current paper is to provide a result of this type.

3 Covariance estimation

Returning to the covariance estimation problem, we deduce the following.

the sample covariance matrix ΣN\Sigma_{N} obtained from NN independent copies of XX satisfies

This result follows by applying Theorem 1.1 for the independent copies of the random vectors Zi=Σ1/2XiZ_{i}=\Sigma^{-1/2}X_{i} instead of XiX_{i}, and by multiplying the matrix 1Ni=1NXiXiTI\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}-I in (4) by Σ1/2\Sigma^{1/2} on the left and on the right. Thus, for distributions satisfying (SR) we conclude that the minimal sample size for the covariance estimation is N=O(n)N=O(n).

Let us illustrate these results with two important examples.

4 Sampling from log-concave distributions and convex sets

where C,c>0C,c>0 are absolute constants. This is obviously stronger than assumption (SR), so Corollary 1.2 applies.

We conclude that the minimal sample size for estimating the covariance matrix of a log-concave distribution is N=O(n)N=O(n). This matches the bound obtained by Adamczak et al. ALPT , though it should be noted that the guarantee of ALPT holds with probability that converges to 11 exponentially fast as nn\to\infty, whereas ours holds only in expectation. We have not tried to obtain probability bounds of this type; note, however, that under our general assumption (SR), the probability cannot converge to 11 faster than at a polynomial rate in nn.

5 Sampling from product distributions

A distribution does not have to be log-concave in order to satisfy the regularity assumptions in Theorem 1.1 and Corollary 1.2. For example, all product distributions with finite 4+ε4+\varepsilon moments have the required regularity property. We can deduce this from the following thin shell estimate:

The factor implicit in (5) depends only on pp and on the bound on the (2p)(2p)th moments.

The proof of Proposition 1.3 is given in the Appendix.

Applying Chebyshev’s inequality together with (5), we obtain for tkt\geq k that

Thus for p>2p>2 we get a sub-linear tail, as required in the regularity assumption (SR).

6 Extreme eigenvalues

Theorem 1.1 states that, for sufficiently large NN, all eigenvalues of the sample covariance matrix ΣN=1Ni=1NXiXiT\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T} are concentrated near 11. It is easy to extend this to a result that holds for all NN, as follows.

Here c=η2η+2c=\frac{\eta}{2\eta+2}, C1=512(16C)1+2/η(6+6/η)1+4/ηC_{1}=512(16C)^{1+2/\eta}(6+6/\eta)^{1+4/\eta} and λmin(ΣN)\lambda_{\min}(\Sigma_{N}), λmax(ΣN)\lambda_{\max}(\Sigma_{N}) denote the smallest and the largest eigenvalues of ΣN\Sigma_{N}, respectively.

We deduce this result in Section 3. One can view (6) as a nonasymptotic form of the Bai–Yin law for the extreme eigenvalues of sample covariance matrices BY . This law, associated with the works of Geman, Bai, Yin, Krishnaiah and Silverstein applies for product distributions, specifically for random vectors X=(ξ1,,ξn)X=(\xi_{1},\ldots,\xi_{n}) with i.i.d. components ξi\xi_{i} with zero mean, unit variance and finite fourth moment. For such distributions one has asymptotically almost surely that

as nn\to\infty and n/Ny[0,1)n/N\to y\in[0,1); see the rigorous statement in BY . This limit law is sharp. On the other hand, inequalities (6) hold in any fixed dimensions N,nN,n and for general distributions (as in Theorem 1.1), without any independence requirements for the coordinates.

Comparing (6) with (7) one can ask about the optimal value of the exponent cc, in particular whether c=1/2c=1/2. In a recent paper ALPTsharp , Adamczak et al. obtained the optimal exponent c=1/2c=1/2 for log-concave distributions, and more generally for sub-exponential distributions in the sense of (1.2). As (1.2) implies (SR) with η=(p1)/2\eta=(p-1)/2 and C(O(p))pC\leq(O(p))^{p}, Theorem 1.1 recovers a bound of c=1/21/(p+1)=1/2o(1)c=1/2-1/(p+1)=1/2-o(1) as pp\to\infty.

[(Random matrices with independent rows)] Corollary 1.4 can be interpreted as a result about the spectrum of random matrices with independent rows. Indeed, if AA is the matrix with rows XiX_{i}, then ΣN=1Ni=1NXiXiT=1NATA\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}=\frac{1}{N}A^{T}A. So the singular values of the matrix 1NA\frac{1}{\sqrt{N}}A are the same as the eigenvalues of the matrix ΣN\Sigma_{N}, and they are controlled as in (6). In particular, under the regularity assumption (SR) on XiX_{i} we obtain that

where C2=2C1C_{2}=\sqrt{2C_{1}}, and C1C_{1} is as in Corollary 1.4.

7 Smallest eigenvalue

Our proof of Theorem 1.1 consists of two separate arguments for upper and lower bounds for the spectrum of the sample covariance matrix. It turns out that the full power of the strong regularity assumption (SR) is not needed for the lower bound. It suffices to assume 2+η2+\eta moments for one-dimensional marginals rather than for marginals in all dimensions. This is only slightly stronger than the isotropy assumption, which fixes the second moments of one-dimensional marginals, and it broadens the class of distributions for which the result applies. We state this as a separate theorem.

the minimum eigenvalue of the sample covariance matrix ΣN=1Ni=1NXiXiT\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T} satisfies

[(Moments vs. tails)] We have chosen to write (WR) in terms of moments rather than in terms of tail bounds as in (SR). By integration of the tails one can check that, for any given η>0\eta>0, (SR) with parameter CC implies (WR) with parameter C=C(2+2/η)C^{\prime}=C(2+2/\eta).

In the remainder of the paper we will use (WR) for theorems regarding only the smallest eigenvalue and (SR) for theorems which involve the largest one.

[(Product distributions with 2+η2+\eta moments)] Many distributions of interest satisfy (WR). For example, let X=(ξ1,,ξn)X=(\xi_{1},\ldots,\xi_{n}) have i.i.d. components ξi\xi_{i} with zero mean, unit variance and finite (2+η)(2+\eta) moment. Then a standard application of symmetrization and Khintchine’s inequality (or a direct application of Rosenthal’s inequality Ros , see FHJSZ ) shows that one-dimensional marginals of XX also have bounded (2+η)(2+\eta) moments; that is, (WR) holds.

In the context of the Bai–Yin law discussed in Section 1.6, this indicates that the smallest eigenvalue of a random matrix can be approximately controlled [as in (6)] even if the fourth moment is infinite. However, as we already recalled, four moments are necessary to control the largest eigenvalue in the classical Bai–Yin law BSY .

[(Covariance estimation)] Theorem 1.5 can be used to obtain a lower estimate for the covariance matrix under the weak regularity assumption (WR).

8 Optimality of the regularity assumptions

Let us briefly mention two simple and known examples that illustrate the role of regularity assumptions (SR) and (WR) in the control of the largest and smallest eigenvalues, respectively.

For the largest eigenvalue as in Theorem 1.1, it is not sufficient to put a regularity assumption of the type (SR) only on one-dimensional marginals, as it is done in Theorem 1.5 for the smallest eigenvalue. Even the following very strong (exponential) moment assumption is insufficient:

which contradicts the conclusion of Theorem 1.1. This example is essentially due to Aubrun; see ALPT , Remark 4.9.

It is not clear whether Theorem 1.1 would hold if, in addition to (2+η)(2+\eta) moments on one-dimensional marginals, one puts a total boundedness assumption

A conjecture of this type is discussed in Vcovariance where a version of the theorem is proved under this assumption, with η=2\eta=2 but with an additional (loglogn)O(1)(\log\log n)^{O(1)} oversampling factor.

9 The argument: Randomizing the spectral sparsifier

Our proof of Theorem 1.1 consists of randomizing the spectral sparsifier invented by Batson, Spielman and Srivastava BSS ; see SPhD . The randomization makes the spectral sparsifier appear naturally in the context of random matrix theory. The method is based on evaluating the Stieltjes transform of ΣN\Sigma_{N} while making rank one updates. However, in contrast to typical methods of random matrix theory (and to the spectral sparsifier itself), we shall evaluate the Stieltjes transform at random real points.

Let us illustrate the method by working out a crude upper bound O(1)O(1) for the largest eigenvalue of ΣN\Sigma_{N}. Equivalently, we want to show that a general Wishart matrix AN:=NΣN=i=1NXiXiTA_{N}:=N\Sigma_{N}=\sum_{i=1}^{N}X_{i}X_{i}^{T} has all eigenvalues bounded by O(N)O(N). We evaluate the Stieltjes transform

where λi(AN)\lambda_{i}(A_{N}) denote the eigenvalues of ANA_{N}. This function has singularities at the points λi(AN)\lambda_{i}(A_{N}), and it vanishes at infinity. So the largest eigenvalue of ANA_{N} is the largest uu where mAN(u)=m_{A_{N}}(u)=\infty. However, such uu is difficult to compute. So we soften this quantity by considering the largest number uNu_{N} that satisfies

where ϕ\phi is a fixed sensitivity parameter, for example, ϕ=1\phi=1.

This is the same problem as in BSS , except the eigenvalues and hence the soft spectral edge uNu_{N} are now random points. The randomized problem is more difficult as we note below.

As opposed to the largest eigenvalue of AA, the soft spectral edge uNu_{N} can be computed inductively using rank-one updates to the matrix; uNu_{N} will move to the right by a random amount at each step as we replace Ak1A_{k-1} by Ak=Ak1+XkXkTA_{k}=A_{k-1}+X_{k}X_{k}^{T}. Initially, A0A_{0} = 0 so u0=nu_{0}=n. It suffices to prove that the uku_{k} moves by O(1)O(1) on average at each step:

This reduces proving (12) to a probabilistic problem, which is essentially governed by the distribution of the random vector XkX_{k}.

The difficulty is that we are facing a nonlinear inverse problem. Indeed, for a fixed uu it is not difficult to compute the expectation of mAk(u)m_{A_{k}}(u) from (13), and in particular to bound the expectation by ϕ\phi; this is done in BSS . However, we require the identity mAk(u)=ϕm_{A_{k}}(u)=\phi to hold deterministically, because the largest uu that satisfies it defines the soft spectral edge of AkA_{k} as in (11). The task of computing the expectation of a random number uu for which mAk(u)=ϕm_{A_{k}}(u)=\phi is a highly nonlinear inverse problem BN , Section 4.1. This is where some regularity of XkX_{k} with respect to the eigenstructure of Ak1A_{k-1} becomes essential. A technical part of our argument developed in most of the remaining sections is to realize and prove that a small amount or regularity encoded by (SR) or (WR) is already sufficient to control the solution to the inverse problem, and ultimately to control the spectral edges of AA.

10 Organization of the paper

The rest of the paper is organized as follows. We start with the somewhat simpler Theorem 1.5 for the smallest eigenvalue in Section 2. A corresponding result for the largest eigenvalue, Theorem 3.1, is proved in Section 3. Corollary 1.4 is also deduced in Section 3. Combining Theorems 1.5 and 3.1 in Section 4, we obtain the main Theorem 1.1 on the spectral norm. In the Appendix, we prove Proposition 1.3 on the regularity of product distributions.

The lower edge

We begin by proving Theorem 1.5 about the the lower edge of the spectrum, which is slightly simpler and requires fewer assumptions than the upper edge. As in BSS , the tool that we use to do this is the lower Stieltjes transform

where c\mbox\refthmlowerrankone1=10(5C)2/ηc_{\mbox{{\ref{thmlowerrankone}}}}^{-1}=10(5C)^{2/\eta}. Then for every symmetric n×nn\times n matrix AA, one has

Iterating Theorem 2.1 easily yields a proof of Theorem 1.5 as follows. {pf*}Proof of Theorem 1.5 Let A0=0A_{0}=0 and Ak=Ak1+XkXkTA_{k}=A_{k-1}+X_{k}X_{k}^{T} for kNk\leq N. Setting ϕ=c\mbox\refthmlowerrankoneε1+2/η\phi=c_{\mbox{{\ref{thmlowerrankone}}}}\varepsilon^{1+2/\eta}, we find that

Applying Theorem 2.1 inductively to A0,A1,,ANA_{0},A_{1},\ldots,A_{N}, we find that

where we take the conditional expectation with respect to the random vector XkX_{k}, given the random vectors X1,,Xk1X_{1},\ldots,X_{k-1}, that is, given Ak1A_{k-1}. Summing up these bounds yields

For Nn/εϕN\geq n/\varepsilon\phi, the bound becomes 12ε1-2\varepsilon. Substituting the value of ϕ\phi and replacing ε\varepsilon by ε/2\varepsilon/2 gives the promised result.

We begin by reducing the feasibility for a shift δ\delta to an inequality involving two quadratic forms. The following lemma appeared in BSS , and we include it with a proof for completeness.

isTo ease the notation, we sometimes write AuA-u instead of AuIA-uI.

Combining these estimates, we see that (2) holds as long as

which we can rearrange into (3) observing that all quadratic forms involved are positive.

The proof is based on regularity properties of the quadratic forms q1q_{1} and q2q_{2}, which we state in the following two lemmas.

q1(0,x)q1(δ,x)(1δϕ)1q1(0,x)q_{1}(0,x)\leq q_{1}(\delta,x)\leq(1-\delta\phi)^{-1}q_{1}(0,x);

(1δϕ)2q2(0,x)q2(δ,x)(1δϕ)2q(0,x)(1-\delta\phi)^{2}q_{2}(0,x)\leq q_{2}(\delta,x)\leq(1-\delta\phi)^{-2}q(0,x).

i(i) Let (ψi)in(\psi_{i})_{i\leq n} denote the eigenvectors of AA; then

Using these for every term in (4), we complete the proof of (i).

(ii) Similar to (i), noting that the numerator and denominator of q2q_{2} are increasing in δ\delta.

(i) As in the proof of the previous lemma, let (ψi)in(\psi_{i})_{i\leq n} denote the eigenvectors of AA. By isotropy we have

For the moment bound we use Minkowski’s inequality to obtain

We can now finish the proof of Lemma 2.3.

Proof of Lemma 2.3 First observe that by construction,

If either of the indicators in the definition of the shift δ\delta is zero, then δ=0\delta=0, which is trivially feasible, and we are done. So assume both indicators are nonzero, that is, q1(0,x)tq_{1}(0,x)\leq t and q2(0,x)t/ϕq_{2}(0,x)\leq t/\phi. By Lemma 2.2, it suffices to prove inequality (3), which is equivalent to

We can show this by replacing δ\delta with zero using Lemma 2.4:

We now complete the proof of Theorem 2.1 by using the regularity properties of XX to show that the expectation of δ\delta, as defined in Lemma 2.3, is large. Roughly speaking, this happens because (1) δ\delta is defined to be slightly less than q2(0,X)q_{2}(0,X) whenever both q1(0,X)q_{1}(0,X) and q2(0,X)q_{2}(0,X) are not too large; (2) that event occurs with very high probability when ϕ\phi is sufficiently small; (3) the expectation of q2(0,X)q_{2}(0,X) equals 11.

The upper edge

In this section we establish the following estimate for the expected largest eigenvalue, analogous to Theorem 1.5 for the smallest one.

the maximum eigenvalue of the sample covariance matrix ΣN=1Ni=1NXiXiT\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T} satisfies

We shall control the largest eigenvalue of a symmetric matrix AA using the (upper) Stieltjes transform

Similarly to our argument for the lower edge, for a sensitivity value ϕ>0\phi>0, we define the upper soft spectral edge uϕ(A)u_{\phi}(A) to be the largest uu for which

Suppose XX is an isotropic random vector satisfying the strong regularity assumption (SR) for some C,η>0C,\eta>0. Assume ε(0,1)\varepsilon\in(0,1) and

where c\mbox\refupperrankone1=256(8C)1+2/η(6+6/η)1+4/ηc_{\mbox{{\ref{upperrankone}}}}^{-1}=256(8C)^{1+2/\eta}(6+6/\eta)^{1+4/\eta}. Then for every symmetric matrix AA, one has

Iterating Theorem 3.2 yields a proof of Theorem 3.1. {pf*}Proof of Theorem 3.1 The argument is similar to the proof of Theorem 1.5 given in Section 2. We set ϕ=ϕ(ε)=c\mbox\refupperrankoneε1+2/η\phi=\phi(\varepsilon)=c_{\mbox{{\ref{upperrankone}}}}\varepsilon^{1+2/\eta}. Then we start with A0=0A_{0}=0 where uϕ(A0)=n/ϕu_{\phi}(A_{0})=n/\phi, and we inductively apply Theorem 3.2 for Ak=Ak1+XkXkTA_{k}=A_{k-1}+X_{k}X_{k}^{T} to obtain

For Nn/εϕN\geq n/\varepsilon\phi, the bound becomes 1+2ε1+2\varepsilon. Substituting the value of ϕ\phi and replacing ε\varepsilon by ε/2\varepsilon/2 gives the promised result.

The above proof works for ε,ϕ(ε)<1\varepsilon,\phi(\varepsilon)<1 and thus for N=Ω(n)N=\Omega(n), but it may be extended to smaller NN as follows.

Proof of Corollary 1.4 In the proof of Theorem 3.1, we have shown that for every ε(0,1)\varepsilon\in(0,1) and every positive integer NN, we have

where ϕ(ε)=c\mbox\refupperrankoneε1+2/η\phi(\varepsilon)=c_{\mbox{{\ref{upperrankone}}}}\varepsilon^{1+2/\eta}. Optimizing in ε\varepsilon, we apply this estimate with ε=(n/N)1/(2+2/η)\varepsilon=(n/N)^{{1}/({2+2/\eta})} when n<Nn<N and with ε=1/2\varepsilon=1/2 when nNn\geq N to obtain

Combining these, for every nn and NN we conclude that

The definition of the soft spectral edge u=uϕ(A)u=u_{\phi}(A) along with monotonicity of the Stieltjes transform implies that

As in our argument for the lower edge, we begin by reducing the feasibility for a shift δ\delta to an inequality involving two quadratic forms.

Note that AuI(u+Δ)IA\prec uI\prec(u+\Delta)I so that all quadratic forms are positive, and assume x0x\neq 0 since otherwise the claim is trivial. As in the proof of Lemma 2.2, we use the Sherman–Morisson formula to write

Rearranging reveals that mA+xxT(u+Δ)mA(u)\overline{m}_{A+xx^{T}}(u+\Delta)\leq\overline{m}_{A}(u) exactly when (3.3) holds.

for all positive matrices R,SR,S (this can be seen, e.g., using the Courant–Fischer theorem). Applying this fact to (12), we see that it suffices to have

which follows from (3.3) and Q2(Δ,x)>0Q_{2}(\Delta,x)>0.

We will reason about the two quantities Q1Q_{1} and Q2Q_{2} separately, producing two separate shifts Δ1\Delta_{1} and Δ2\Delta_{2} for them and eventually combining these into a single Δ:=Δ1Δ2\Delta:=\Delta_{1}\lor\Delta_{2}, as required by Lemma 3.3.

For some fixed parameter τ(0,1)\tau\in(0,1), let us define Δ1=Δ1(A,x,u)\Delta_{1}=\Delta_{1}(A,x,u) and Δ2=Δ2(A,x,u)\Delta_{2}=\Delta_{2}(A,x,u) to be the smallest nonnegative numbers such which satisfy

For u=uϕ(A)u=u_{\phi}(A) and for a random vector x=Xx=X, Lemmas 3.4 and 3.6 will allow us to control the expected value of each of these shifts, so

whenever the sensitivity parameter ϕ=ϕ(τ,ε)\phi=\phi(\tau,\varepsilon) is sufficiently small. From this we will obtain Theorem 3.2 quickly as follows.

Proof of Theorem 3.2 Let uϕ(A)=uu_{\phi}(A)=u, so the condition AuIA\prec uI of Lemma 3.3 holds. Consider the shifts Δ1=Δ1(A,X,u)\Delta_{1}=\Delta_{1}(A,X,u) and Δ2=Δ2(A,X,u)\Delta_{2}=\Delta_{2}(A,X,u) defined above. By (13), we have

Moreover, a quick inspection of the quadratic forms in Lemma 3.3 shows that Q1(Δ,X)Q_{1}(\Delta,X) and Q2(Δ,X)Q_{2}(\Delta,X) are decreasing in Δ\Delta, and hence

Then Lemma 3.3 guarantees that Δ1Δ2\Delta_{1}\vee\Delta_{2} is a feasible upper shift, which implies by (10) that

Furthermore, (14) yields a bound on the expected shift

which gives conclusion (8) of Theorem 3.2.

It remains to note that Lemmas 3.4 and 3.6 only guarantee that the bounds (14) hold when the sensitivity ϕ\phi is sufficiently small, namely ϕϕ1(τ,ε/2)ϕ2(τ,ε/2)\phi\leq\phi_{1}(\tau,\varepsilon/2)\wedge\phi_{2}(\tau,\varepsilon/2). With τ=ε/16\tau=\varepsilon/16, we can simplify this inequality into the assumption of Theorem 3.2.

The rest of this section is devoted to controlling the shifts Δ1\Delta_{1} and Δ2\Delta_{2}.

It is easy to check that the proofs of Lemmas 3.4 and 3.6 which follow, and consequently Theorem 3.2, only require

then the shift Δ1=Δ1(A,X,u)\Delta_{1}=\Delta_{1}(A,X,u) satisfies

Let (ψi)in(\psi_{i})_{i\leq n} and (λi)in(\lambda_{i})_{i\leq n} denote the eigenvectors and eigenvalues of AA, and let ξi=X,ψi2\xi_{i}=\langle X,\psi_{i}\rangle^{2}. We know that mA(u)=i=1n(uλi)1ϕ\overline{m}_{A}(u)=\sum_{i=1}^{n}(u-\lambda_{i})^{-1}\leq\phi, and Δ1\Delta_{1} is the smallest nonnegative number satisfying

Rescaling everything by ϕ\phi and setting μi:=ϕ(uλi)\mu_{i}:=\phi(u-\lambda_{i}) so that

the problem becomes equivalent to bounding the least μ:=ϕΔ1\mu:=\phi\Delta_{1} for which

Applying the following, somewhat more general, probabilistic lemma to (ξi)in(\xi_{i})_{i\leq n}, we conclude that

Substituting ϕ=ϕ1(τ,ε)\phi=\phi_{1}(\tau,\varepsilon) gives the promised bound.

for all subsets S[n]S\subset[n] and some constants C,η>0C,\eta>0. Consider positive numbers μi\mu_{i} such that

Let μ\mu be the minimal positive number such that

For simplicity of calculations, assume for the moment that the values of all μi\mu_{i} are dyadic, that is,

and μ\mu is the smallest positive number such that

We estimate μ\mu by replacing it with a bigger but easier quantity μ\mu^{\prime}. Define μ\mu^{\prime} to be the smallest positive number such that, for every dyadic kk, one has

the definition of μ\mu given in (17) yields

Let θk=1εkiIkξik\theta_{k}=\frac{1}{\varepsilon_{k}}\sum_{i\in I_{k}}\xi_{i}-k. For every t0t\geq 0, one has

Since εkKnk2k\varepsilon_{k}\geq\frac{Kn_{k}}{2k} by definition, we have

The promised bound for general (nondyadic) μi\mu_{i} follows by rounding each μi\mu_{i} down to the nearest power of 22 and replacing KK by K/2K/2.

[[Necessity of the strong regularity assumption (SR)]] The preceding lemma is the only place in the proof where the full power of (SR) is used. To see that it is necessary, consider the following situation. Fix any S[n]S\subset[n], and let 1μi=1{iS}S\frac{1}{\mu_{i}}={\mathbf{1}}_{\{i\in S\}}|S| so that i1μi=1\sum_{i}\frac{1}{\mu_{i}}=1. Then the smallest μ0\mu\geq 0 for which i1μi+μK\sum_{i}\frac{1}{\mu_{i}+\mu}\leq K is just

then the shift Δ2=Δ2(A,X,u)\Delta_{2}=\Delta_{2}(A,X,u) satisfies

It will be more convenient to work with the quadratic form

The reason for working with Q2Q_{2} rather than directly with Q2Q_{2}^{\prime} in Lemma 3.3 is that Q2(Δ,x)Q_{2}(\Delta,x) is decreasing in Δ\Delta; this monotonicity is required when arguing that the maximum of the two shifts Δ=Δ1Δ2\Delta=\Delta_{1}\lor\Delta_{2} is feasible in the proof of Theorem 3.2.

We begin by recording some regularity properties of Q2(Δ,X)Q_{2}^{\prime}(\Delta,X).

Q2(Δ,X)(1+ϕΔ)2Q2(0,X)Q_{2}^{\prime}(\Delta,X)\leq(1+\phi\Delta)^{2}Q_{2}^{\prime}(0,X);

(i) is analogous to Lemma 2.4. In a similar way, we show that all eigenvalues λi\lambda_{i} of AA satisfy uλi1/ϕu-\lambda_{i}\geq 1/\phi, which implies the comparison inequality

Denoting (ψi)in(\psi_{i})_{i\leq n} the eigenvectors of AA, we express

i(ii) We note that (20) can be rearranged as a convex combination of X,ψi2\langle X,\psi_{i}\rangle^{2}.

(iii) We apply Minkowski’s inequality to obtain

Now a simple integration of tails implies that each

which concludes the proof. Next, we see how the regularity properties of Q2(Δ,X)Q_{2}^{\prime}(\Delta,X) translate into the corresponding properties of Δ2\Delta_{2}:

(i) By definition of Δ2\Delta_{2} and using (19), we have for all t>0t>0,

This probability can be controlled using Lemma 3.7(iii) and Markov’s inequality, so we obtain

as τ<1/2\tau<1/2. Integration of tails yields

(ii) Let s0s_{0} denote the smaller solution of the quadratic equation

whenever a solution exists. In this case s0>0s_{0}>0 and Lemma 3.7(i) yields that

By (19), this yields Q2(s0,X)s0(1τ)Q_{2}(s_{0},X)\leq s_{0}(1-\tau). By definition of Δ2\Delta_{2}, this in turn implies that

An elementary calculation shows that if Q2(0,X)(t2τ)/8ϕQ_{2}^{\prime}(0,X)\leq(t-2\tau)/8\phi, then the solution s0s_{0} exists and satisfies

where we used Lemma 3.7(i) in the last step.

We can now complete the proof of Lemma 3.6.

By Lemma 3.8(ii), we have E11+tE_{1}\leq 1+t. Next, we estimate E2E_{2} using Hölder’s inequality,

The two terms here can be estimated using Lemma 3.8(i) and Lemma 3.7 along with Markov’s inequality,

Finally, we set t=ε/2t=\varepsilon/2 and use the assumptions ϕϕ2(τ,ε)\phi\leq\phi_{2}(\tau,\varepsilon) and τ<ε/2\tau<\varepsilon/2 to conclude that E2ε/2E_{2}\leq\varepsilon/2. Together with E11+t=1+ε/2E_{1}\leq 1+t=1+\varepsilon/2 this implies

Although for convenience of application Lemma 3.6 is stated under the strong regularity assumption (SR), the latter is not used in the proof. The argument above uses only the weak regularity assumption (WR).

The spectral norm

In this section we prove Theorem 1.1 by showing that whenever X1,,XNX_{1},\ldots,X_{N} are independent and satisfy (SR), the spectral norm estimate

obtained in Theorems 1.5 and 3.1. The basic idea is to show using independence that

is concentrated near its expectation of 11. Combining this with

which follows immediately from (22), yields (21).

We rely on the following elementary proposition regarding sums of independent random variables.

Postponing the proof of Proposition 4.1, we use this fact to control

Proof of Theorem 1.1 Assume the random vectors XiX_{i} are isotropic and satisfy (SR) with parameters C,ηC,\eta. This implies that the random variables

satisfy the requirements of Proposition 4.1 with parameters C1+η,ηC^{1+\eta},\eta. It follows that

Replacing ε\varepsilon by ε/3\varepsilon/3 and taking

always satisfies (26). This completes the proof of the theorem.

Proof of Proposition 4.1 Fix a parameter K>0K>0, and decompose

By Jensen’s inequality, independence and the bound on ZiZ_{i}^{\prime}, we have

Moreover, by triangle and Jensen’s inequalities,

Choosing K=(ε/2)NK=(\varepsilon/2)\sqrt{N} and using the assumption on NN, one easily checks that

Appendix: Proof of Proposition 1.3

In this section we prove Proposition 1.3, which states that product distributions satisfy the regularity assumption in Theorem 1.1. Note that this result and its proof are not needed in the proof of Theorem 1.1.

The contribution of the diagonal of PP to this sum is

Denote by P0P_{0} the matrix PP with diagonal removed; then

This inequality can be obtained from general decoupling results; see dlPG , Theorem 3.1.1; a simple and well-known proof of (2) is given in Vdecoupling .

Therefore, by conditioning on XX^{\prime} we obtain from (2) that

Since P0P_{0} equals PP without the diagonal, the triangle inequality yields

Since 0<PiiP10<P_{ii}\leq\|P\|\leq 1, we can replace Pii2P_{ii}^{2} by PiiP_{ii}, so

Putting (1), (3) and (4) together, we arrive at the inequality

Put in different words, the random variable Z:=PX22DZ:=\|PX\|_{2}^{2}-D satisfies the inequality

Solving this quadratic inequality we obtain that

In order to bound DLp\|D\|_{L_{p}} we consider

Finally, by definition of ZZ and using the triangle inequality and bounds (7), (6), we conclude that

Acknowledgments

The authors are grateful to the referees whose comments improved the presentation of the paper.

References