Covariance estimation for distributions with $2+\varepsilon$ moments

Nikhil Srivastava, Roman Vershynin

Introduction

Our goal is to estimate $\Sigma$ from a sample $X_{1},\ldots,X_{N}$ taken from the same distribution as $X$ . A classical unbiased estimator for $\Sigma$ is the sample covariance matrix

A basic question is to determine the minimal sample size $N$ which guarantees that $\Sigma$ is accurately estimated by $\Sigma_{N}$ . More precisely, for a given accuracy $\varepsilon>0$ , we are interested in the minimal $N=N(n,\varepsilon)$ so that

where $\|\cdot\|$ denotes the spectral (operator) norm. Replacing $X$ by $\Sigma^{-1/2}X$ and $X_{i}$ by $\Sigma^{-1/2}X_{i}$ , we reduce the problem to the distributions for which $\Sigma=I$ , that is, to isotropic distributions.

2 Sampling from isotropic distributions

For obvious-dimensional reasons, one must have $N\geq n$ . Rudelson’s remarkably general result (R , see Vtutorial , Section 4.3) yields that if $\|X\|_{2}=O(\sqrt{n})$ almost surely, then

where the $O(\cdot)$ notation hides the dependence on $\varepsilon$ here and thereafter. It is well known that the logarithmic oversampling factor cannot be removed from (1) in general, for example, if the distribution is supported on $O(n)$ points; see Section 1.8.

holds for every distribution that satisfies

This result can be obtained by a standard covering argument; see Vtutorial , Section 4.3.

It is an open problem to describe the distributions for which the logarithmic oversampling is not needed, that is, for which $N=O(n)$ . The gap between sub-Gaussian distributions where this bound holds and discrete distributions on $O(n)$ points where it fails is quite large.

It is already a difficult problem to relax the sub-Gaussian moment assumption (2) to anything weaker while keeping $N=O(n)$ . A major step was made by Adamczak et al. ALPT , who showed that $N=O(n)$ still holds (in fact, with high probability) under the sub-exponential moment assumptions

The second author of the present paper speculated in Vcovariance that $N=O(n)$ should hold for a much wider class of distributions than sub-exponential, perhaps for all distributions with $2+\varepsilon$ moments. (The second moment—the variance—is assumed to be finite by the nature of the problem, as otherwise the covariance matrix is not defined.) The goal of the the current paper is to provide a result of this type.

3 Covariance estimation

Returning to the covariance estimation problem, we deduce the following.

the sample covariance matrix $\Sigma_{N}$ obtained from $N$ independent copies of $X$ satisfies

This result follows by applying Theorem 1.1 for the independent copies of the random vectors $Z_{i}=\Sigma^{-1/2}X_{i}$ instead of $X_{i}$ , and by multiplying the matrix $\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}-I$ in (4) by $\Sigma^{1/2}$ on the left and on the right. Thus, for distributions satisfying (SR) we conclude that the minimal sample size for the covariance estimation is $N=O(n)$ .

Let us illustrate these results with two important examples.

4 Sampling from log-concave distributions and convex sets

where $C,c>0$ are absolute constants. This is obviously stronger than assumption (SR), so Corollary 1.2 applies.

We conclude that the minimal sample size for estimating the covariance matrix of a log-concave distribution is $N=O(n)$ . This matches the bound obtained by Adamczak et al. ALPT , though it should be noted that the guarantee of ALPT holds with probability that converges to $1$ exponentially fast as $n\to\infty$ , whereas ours holds only in expectation. We have not tried to obtain probability bounds of this type; note, however, that under our general assumption (SR), the probability cannot converge to $1$ faster than at a polynomial rate in $n$ .

5 Sampling from product distributions

A distribution does not have to be log-concave in order to satisfy the regularity assumptions in Theorem 1.1 and Corollary 1.2. For example, all product distributions with finite $4+\varepsilon$ moments have the required regularity property. We can deduce this from the following thin shell estimate:

The factor implicit in (5) depends only on $p$ and on the bound on the $(2p)$ th moments.

The proof of Proposition 1.3 is given in the Appendix.

Applying Chebyshev’s inequality together with (5), we obtain for $t\geq k$ that

Thus for $p>2$ we get a sub-linear tail, as required in the regularity assumption (SR).

6 Extreme eigenvalues

Theorem 1.1 states that, for sufficiently large $N$ , all eigenvalues of the sample covariance matrix $\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}$ are concentrated near $1$ . It is easy to extend this to a result that holds for all $N$ , as follows.

Here $c=\frac{\eta}{2\eta+2}$ , $C_{1}=512(16C)^{1+2/\eta}(6+6/\eta)^{1+4/\eta}$ and $\lambda_{\min}(\Sigma_{N})$ , $\lambda_{\max}(\Sigma_{N})$ denote the smallest and the largest eigenvalues of $\Sigma_{N}$ , respectively.

We deduce this result in Section 3. One can view (6) as a nonasymptotic form of the Bai–Yin law for the extreme eigenvalues of sample covariance matrices BY . This law, associated with the works of Geman, Bai, Yin, Krishnaiah and Silverstein applies for product distributions, specifically for random vectors $X=(\xi_{1},\ldots,\xi_{n})$ with i.i.d. components $\xi_{i}$ with zero mean, unit variance and finite fourth moment. For such distributions one has asymptotically almost surely that

as $n\to\infty$ and $n/N\to y\in[0,1)$ ; see the rigorous statement in BY . This limit law is sharp. On the other hand, inequalities (6) hold in any fixed dimensions $N,n$ and for general distributions (as in Theorem 1.1), without any independence requirements for the coordinates.

Comparing (6) with (7) one can ask about the optimal value of the exponent $c$ , in particular whether $c=1/2$ . In a recent paper ALPTsharp , Adamczak et al. obtained the optimal exponent $c=1/2$ for log-concave distributions, and more generally for sub-exponential distributions in the sense of (1.2). As (1.2) implies (SR) with $\eta=(p-1)/2$ and $C\leq(O(p))^{p}$ , Theorem 1.1 recovers a bound of $c=1/2-1/(p+1)=1/2-o(1)$ as $p\to\infty$ .

[(Random matrices with independent rows)] Corollary 1.4 can be interpreted as a result about the spectrum of random matrices with independent rows. Indeed, if $A$ is the matrix with rows $X_{i}$ , then $\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}=\frac{1}{N}A^{T}A$ . So the singular values of the matrix $\frac{1}{\sqrt{N}}A$ are the same as the eigenvalues of the matrix $\Sigma_{N}$ , and they are controlled as in (6). In particular, under the regularity assumption (SR) on $X_{i}$ we obtain that

where $C_{2}=\sqrt{2C_{1}}$ , and $C_{1}$ is as in Corollary 1.4.

7 Smallest eigenvalue

Our proof of Theorem 1.1 consists of two separate arguments for upper and lower bounds for the spectrum of the sample covariance matrix. It turns out that the full power of the strong regularity assumption (SR) is not needed for the lower bound. It suffices to assume $2+\eta$ moments for one-dimensional marginals rather than for marginals in all dimensions. This is only slightly stronger than the isotropy assumption, which fixes the second moments of one-dimensional marginals, and it broadens the class of distributions for which the result applies. We state this as a separate theorem.

the minimum eigenvalue of the sample covariance matrix $\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}$ satisfies

[(Moments vs. tails)] We have chosen to write (WR) in terms of moments rather than in terms of tail bounds as in (SR). By integration of the tails one can check that, for any given $\eta>0$ , (SR) with parameter $C$ implies (WR) with parameter $C^{\prime}=C(2+2/\eta)$ .

In the remainder of the paper we will use (WR) for theorems regarding only the smallest eigenvalue and (SR) for theorems which involve the largest one.

[(Product distributions with $2+\eta$ moments)] Many distributions of interest satisfy (WR). For example, let $X=(\xi_{1},\ldots,\xi_{n})$ have i.i.d. components $\xi_{i}$ with zero mean, unit variance and finite $(2+\eta)$ moment. Then a standard application of symmetrization and Khintchine’s inequality (or a direct application of Rosenthal’s inequality Ros , see FHJSZ ) shows that one-dimensional marginals of $X$ also have bounded $(2+\eta)$ moments; that is, (WR) holds.

In the context of the Bai–Yin law discussed in Section 1.6, this indicates that the smallest eigenvalue of a random matrix can be approximately controlled [as in (6)] even if the fourth moment is infinite. However, as we already recalled, four moments are necessary to control the largest eigenvalue in the classical Bai–Yin law BSY .

[(Covariance estimation)] Theorem 1.5 can be used to obtain a lower estimate for the covariance matrix under the weak regularity assumption (WR).

8 Optimality of the regularity assumptions

Let us briefly mention two simple and known examples that illustrate the role of regularity assumptions (SR) and (WR) in the control of the largest and smallest eigenvalues, respectively.

For the largest eigenvalue as in Theorem 1.1, it is not sufficient to put a regularity assumption of the type (SR) only on one-dimensional marginals, as it is done in Theorem 1.5 for the smallest eigenvalue. Even the following very strong (exponential) moment assumption is insufficient:

which contradicts the conclusion of Theorem 1.1. This example is essentially due to Aubrun; see ALPT , Remark 4.9.

It is not clear whether Theorem 1.1 would hold if, in addition to $(2+\eta)$ moments on one-dimensional marginals, one puts a total boundedness assumption

A conjecture of this type is discussed in Vcovariance where a version of the theorem is proved under this assumption, with $\eta=2$ but with an additional $(\log\log n)^{O(1)}$ oversampling factor.

9 The argument: Randomizing the spectral sparsifier

Our proof of Theorem 1.1 consists of randomizing the spectral sparsifier invented by Batson, Spielman and Srivastava BSS ; see SPhD . The randomization makes the spectral sparsifier appear naturally in the context of random matrix theory. The method is based on evaluating the Stieltjes transform of $\Sigma_{N}$ while making rank one updates. However, in contrast to typical methods of random matrix theory (and to the spectral sparsifier itself), we shall evaluate the Stieltjes transform at random real points.

Let us illustrate the method by working out a crude upper bound $O(1)$ for the largest eigenvalue of $\Sigma_{N}$ . Equivalently, we want to show that a general Wishart matrix $A_{N}:=N\Sigma_{N}=\sum_{i=1}^{N}X_{i}X_{i}^{T}$ has all eigenvalues bounded by $O(N)$ . We evaluate the Stieltjes transform

where $\lambda_{i}(A_{N})$ denote the eigenvalues of $A_{N}$ . This function has singularities at the points $\lambda_{i}(A_{N})$ , and it vanishes at infinity. So the largest eigenvalue of $A_{N}$ is the largest $u$ where $m_{A_{N}}(u)=\infty$ . However, such $u$ is difficult to compute. So we soften this quantity by considering the largest number $u_{N}$ that satisfies

where $\phi$ is a fixed sensitivity parameter, for example, $\phi=1$ .

This is the same problem as in BSS , except the eigenvalues and hence the soft spectral edge $u_{N}$ are now random points. The randomized problem is more difficult as we note below.

As opposed to the largest eigenvalue of $A$ , the soft spectral edge $u_{N}$ can be computed inductively using rank-one updates to the matrix; $u_{N}$ will move to the right by a random amount at each step as we replace $A_{k-1}$ by $A_{k}=A_{k-1}+X_{k}X_{k}^{T}$ . Initially, $A_{0}$ = 0 so $u_{0}=n$ . It suffices to prove that the $u_{k}$ moves by $O(1)$ on average at each step:

This reduces proving (12) to a probabilistic problem, which is essentially governed by the distribution of the random vector $X_{k}$ .

The difficulty is that we are facing a nonlinear inverse problem. Indeed, for a fixed $u$ it is not difficult to compute the expectation of $m_{A_{k}}(u)$ from (13), and in particular to bound the expectation by $\phi$ ; this is done in BSS . However, we require the identity $m_{A_{k}}(u)=\phi$ to hold deterministically, because the largest $u$ that satisfies it defines the soft spectral edge of $A_{k}$ as in (11). The task of computing the expectation of a random number $u$ for which $m_{A_{k}}(u)=\phi$ is a highly nonlinear inverse problem BN , Section 4.1. This is where some regularity of $X_{k}$ with respect to the eigenstructure of $A_{k-1}$ becomes essential. A technical part of our argument developed in most of the remaining sections is to realize and prove that a small amount or regularity encoded by (SR) or (WR) is already sufficient to control the solution to the inverse problem, and ultimately to control the spectral edges of $A$ .

10 Organization of the paper

The rest of the paper is organized as follows. We start with the somewhat simpler Theorem 1.5 for the smallest eigenvalue in Section 2. A corresponding result for the largest eigenvalue, Theorem 3.1, is proved in Section 3. Corollary 1.4 is also deduced in Section 3. Combining Theorems 1.5 and 3.1 in Section 4, we obtain the main Theorem 1.1 on the spectral norm. In the Appendix, we prove Proposition 1.3 on the regularity of product distributions.

The lower edge

We begin by proving Theorem 1.5 about the the lower edge of the spectrum, which is slightly simpler and requires fewer assumptions than the upper edge. As in BSS , the tool that we use to do this is the lower Stieltjes transform

where $c_{\mbox{{\ref{thmlowerrankone}}}}^{-1}=10(5C)^{2/\eta}$ . Then for every symmetric $n\times n$ matrix $A$ , one has

Iterating Theorem 2.1 easily yields a proof of Theorem 1.5 as follows. {pf*}Proof of Theorem 1.5 Let $A_{0}=0$ and $A_{k}=A_{k-1}+X_{k}X_{k}^{T}$ for $k\leq N$ . Setting $\phi=c_{\mbox{{\ref{thmlowerrankone}}}}\varepsilon^{1+2/\eta}$ , we find that

Applying Theorem 2.1 inductively to $A_{0},A_{1},\ldots,A_{N}$ , we find that

where we take the conditional expectation with respect to the random vector $X_{k}$ , given the random vectors $X_{1},\ldots,X_{k-1}$ , that is, given $A_{k-1}$ . Summing up these bounds yields

For $N\geq n/\varepsilon\phi$ , the bound becomes $1-2\varepsilon$ . Substituting the value of $\phi$ and replacing $\varepsilon$ by $\varepsilon/2$ gives the promised result.

We begin by reducing the feasibility for a shift $\delta$ to an inequality involving two quadratic forms. The following lemma appeared in BSS , and we include it with a proof for completeness.

isTo ease the notation, we sometimes write $A-u$ instead of $A-uI$ .

Combining these estimates, we see that (2) holds as long as

which we can rearrange into (3) observing that all quadratic forms involved are positive.

The proof is based on regularity properties of the quadratic forms $q_{1}$ and $q_{2}$ , which we state in the following two lemmas.

$q_{1}(0,x)\leq q_{1}(\delta,x)\leq(1-\delta\phi)^{-1}q_{1}(0,x)$ ;

$(1-\delta\phi)^{2}q_{2}(0,x)\leq q_{2}(\delta,x)\leq(1-\delta\phi)^{-2}q(0,x)$ .

i(i) Let $(\psi_{i})_{i\leq n}$ denote the eigenvectors of $A$ ; then

Using these for every term in (4), we complete the proof of (i).

(ii) Similar to (i), noting that the numerator and denominator of $q_{2}$ are increasing in $\delta$ .

(i) As in the proof of the previous lemma, let $(\psi_{i})_{i\leq n}$ denote the eigenvectors of $A$ . By isotropy we have

For the moment bound we use Minkowski’s inequality to obtain

We can now finish the proof of Lemma 2.3.

Proof of Lemma 2.3 First observe that by construction,

If either of the indicators in the definition of the shift $\delta$ is zero, then $\delta=0$ , which is trivially feasible, and we are done. So assume both indicators are nonzero, that is, $q_{1}(0,x)\leq t$ and $q_{2}(0,x)\leq t/\phi$ . By Lemma 2.2, it suffices to prove inequality (3), which is equivalent to

We can show this by replacing $\delta$ with zero using Lemma 2.4:

We now complete the proof of Theorem 2.1 by using the regularity properties of $X$ to show that the expectation of $\delta$ , as defined in Lemma 2.3, is large. Roughly speaking, this happens because (1) $\delta$ is defined to be slightly less than $q_{2}(0,X)$ whenever both $q_{1}(0,X)$ and $q_{2}(0,X)$ are not too large; (2) that event occurs with very high probability when $\phi$ is sufficiently small; (3) the expectation of $q_{2}(0,X)$ equals $1$ .

The upper edge

In this section we establish the following estimate for the expected largest eigenvalue, analogous to Theorem 1.5 for the smallest one.

the maximum eigenvalue of the sample covariance matrix $\Sigma_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}X_{i}^{T}$ satisfies

We shall control the largest eigenvalue of a symmetric matrix $A$ using the (upper) Stieltjes transform

Similarly to our argument for the lower edge, for a sensitivity value $\phi>0$ , we define the upper soft spectral edge $u_{\phi}(A)$ to be the largest $u$ for which

Suppose $X$ is an isotropic random vector satisfying the strong regularity assumption (SR) for some $C,\eta>0$ . Assume $\varepsilon\in(0,1)$ and

where $c_{\mbox{{\ref{upperrankone}}}}^{-1}=256(8C)^{1+2/\eta}(6+6/\eta)^{1+4/\eta}$ . Then for every symmetric matrix $A$ , one has

Iterating Theorem 3.2 yields a proof of Theorem 3.1. {pf*}Proof of Theorem 3.1 The argument is similar to the proof of Theorem 1.5 given in Section 2. We set $\phi=\phi(\varepsilon)=c_{\mbox{{\ref{upperrankone}}}}\varepsilon^{1+2/\eta}$ . Then we start with $A_{0}=0$ where $u_{\phi}(A_{0})=n/\phi$ , and we inductively apply Theorem 3.2 for $A_{k}=A_{k-1}+X_{k}X_{k}^{T}$ to obtain

For $N\geq n/\varepsilon\phi$ , the bound becomes $1+2\varepsilon$ . Substituting the value of $\phi$ and replacing $\varepsilon$ by $\varepsilon/2$ gives the promised result.

The above proof works for $\varepsilon,\phi(\varepsilon)<1$ and thus for $N=\Omega(n)$ , but it may be extended to smaller $N$ as follows.

Proof of Corollary 1.4 In the proof of Theorem 3.1, we have shown that for every $\varepsilon\in(0,1)$ and every positive integer $N$ , we have

where $\phi(\varepsilon)=c_{\mbox{{\ref{upperrankone}}}}\varepsilon^{1+2/\eta}$ . Optimizing in $\varepsilon$ , we apply this estimate with $\varepsilon=(n/N)^{{1}/({2+2/\eta})}$ when $n<N$ and with $\varepsilon=1/2$ when $n\geq N$ to obtain

Combining these, for every $n$ and $N$ we conclude that

The definition of the soft spectral edge $u=u_{\phi}(A)$ along with monotonicity of the Stieltjes transform implies that

As in our argument for the lower edge, we begin by reducing the feasibility for a shift $\delta$ to an inequality involving two quadratic forms.

Note that $A\prec uI\prec(u+\Delta)I$ so that all quadratic forms are positive, and assume $x\neq 0$ since otherwise the claim is trivial. As in the proof of Lemma 2.2, we use the Sherman–Morisson formula to write

Rearranging reveals that $\overline{m}_{A+xx^{T}}(u+\Delta)\leq\overline{m}_{A}(u)$ exactly when (3.3) holds.

for all positive matrices $R,S$ (this can be seen, e.g., using the Courant–Fischer theorem). Applying this fact to (12), we see that it suffices to have

which follows from (3.3) and $Q_{2}(\Delta,x)>0$ .

We will reason about the two quantities $Q_{1}$ and $Q_{2}$ separately, producing two separate shifts $\Delta_{1}$ and $\Delta_{2}$ for them and eventually combining these into a single $\Delta:=\Delta_{1}\lor\Delta_{2}$ , as required by Lemma 3.3.

For some fixed parameter $\tau\in(0,1)$ , let us define $\Delta_{1}=\Delta_{1}(A,x,u)$ and $\Delta_{2}=\Delta_{2}(A,x,u)$ to be the smallest nonnegative numbers such which satisfy

For $u=u_{\phi}(A)$ and for a random vector $x=X$ , Lemmas 3.4 and 3.6 will allow us to control the expected value of each of these shifts, so

whenever the sensitivity parameter $\phi=\phi(\tau,\varepsilon)$ is sufficiently small. From this we will obtain Theorem 3.2 quickly as follows.

Proof of Theorem 3.2 Let $u_{\phi}(A)=u$ , so the condition $A\prec uI$ of Lemma 3.3 holds. Consider the shifts $\Delta_{1}=\Delta_{1}(A,X,u)$ and $\Delta_{2}=\Delta_{2}(A,X,u)$ defined above. By (13), we have

Moreover, a quick inspection of the quadratic forms in Lemma 3.3 shows that $Q_{1}(\Delta,X)$ and $Q_{2}(\Delta,X)$ are decreasing in $\Delta$ , and hence

Then Lemma 3.3 guarantees that $\Delta_{1}\vee\Delta_{2}$ is a feasible upper shift, which implies by (10) that

Furthermore, (14) yields a bound on the expected shift

which gives conclusion (8) of Theorem 3.2.

It remains to note that Lemmas 3.4 and 3.6 only guarantee that the bounds (14) hold when the sensitivity $\phi$ is sufficiently small, namely $\phi\leq\phi_{1}(\tau,\varepsilon/2)\wedge\phi_{2}(\tau,\varepsilon/2)$ . With $\tau=\varepsilon/16$ , we can simplify this inequality into the assumption of Theorem 3.2.

The rest of this section is devoted to controlling the shifts $\Delta_{1}$ and $\Delta_{2}$ .

It is easy to check that the proofs of Lemmas 3.4 and 3.6 which follow, and consequently Theorem 3.2, only require

then the shift $\Delta_{1}=\Delta_{1}(A,X,u)$ satisfies

Let $(\psi_{i})_{i\leq n}$ and $(\lambda_{i})_{i\leq n}$ denote the eigenvectors and eigenvalues of $A$ , and let $\xi_{i}=\langle X,\psi_{i}\rangle^{2}$ . We know that $\overline{m}_{A}(u)=\sum_{i=1}^{n}(u-\lambda_{i})^{-1}\leq\phi$ , and $\Delta_{1}$ is the smallest nonnegative number satisfying

Rescaling everything by $\phi$ and setting $\mu_{i}:=\phi(u-\lambda_{i})$ so that

the problem becomes equivalent to bounding the least $\mu:=\phi\Delta_{1}$ for which

Applying the following, somewhat more general, probabilistic lemma to $(\xi_{i})_{i\leq n}$ , we conclude that

Substituting $\phi=\phi_{1}(\tau,\varepsilon)$ gives the promised bound.

for all subsets $S\subset[n]$ and some constants $C,\eta>0$ . Consider positive numbers $\mu_{i}$ such that

Let $\mu$ be the minimal positive number such that

For simplicity of calculations, assume for the moment that the values of all $\mu_{i}$ are dyadic, that is,

and $\mu$ is the smallest positive number such that

We estimate $\mu$ by replacing it with a bigger but easier quantity $\mu^{\prime}$ . Define $\mu^{\prime}$ to be the smallest positive number such that, for every dyadic $k$ , one has

the definition of $\mu$ given in (17) yields

Let $\theta_{k}=\frac{1}{\varepsilon_{k}}\sum_{i\in I_{k}}\xi_{i}-k$ . For every $t\geq 0$ , one has

Since $\varepsilon_{k}\geq\frac{Kn_{k}}{2k}$ by definition, we have

The promised bound for general (nondyadic) $\mu_{i}$ follows by rounding each $\mu_{i}$ down to the nearest power of $2$ and replacing $K$ by $K/2$ .

[[Necessity of the strong regularity assumption (SR)]] The preceding lemma is the only place in the proof where the full power of (SR) is used. To see that it is necessary, consider the following situation. Fix any $S\subset[n]$ , and let $\frac{1}{\mu_{i}}={\mathbf{1}}_{\{i\in S\}}|S|$ so that $\sum_{i}\frac{1}{\mu_{i}}=1$ . Then the smallest $\mu\geq 0$ for which $\sum_{i}\frac{1}{\mu_{i}+\mu}\leq K$ is just

then the shift $\Delta_{2}=\Delta_{2}(A,X,u)$ satisfies

It will be more convenient to work with the quadratic form

The reason for working with $Q_{2}$ rather than directly with $Q_{2}^{\prime}$ in Lemma 3.3 is that $Q_{2}(\Delta,x)$ is decreasing in $\Delta$ ; this monotonicity is required when arguing that the maximum of the two shifts $\Delta=\Delta_{1}\lor\Delta_{2}$ is feasible in the proof of Theorem 3.2.

We begin by recording some regularity properties of $Q_{2}^{\prime}(\Delta,X)$ .

$Q_{2}^{\prime}(\Delta,X)\leq(1+\phi\Delta)^{2}Q_{2}^{\prime}(0,X)$ ;

(i) is analogous to Lemma 2.4. In a similar way, we show that all eigenvalues $\lambda_{i}$ of $A$ satisfy $u-\lambda_{i}\geq 1/\phi$ , which implies the comparison inequality

Denoting $(\psi_{i})_{i\leq n}$ the eigenvectors of $A$ , we express

i(ii) We note that (20) can be rearranged as a convex combination of $\langle X,\psi_{i}\rangle^{2}$ .

(iii) We apply Minkowski’s inequality to obtain

Now a simple integration of tails implies that each

which concludes the proof. Next, we see how the regularity properties of $Q_{2}^{\prime}(\Delta,X)$ translate into the corresponding properties of $\Delta_{2}$ :

(i) By definition of $\Delta_{2}$ and using (19), we have for all $t>0$ ,

This probability can be controlled using Lemma 3.7(iii) and Markov’s inequality, so we obtain

as $\tau<1/2$ . Integration of tails yields

(ii) Let $s_{0}$ denote the smaller solution of the quadratic equation

whenever a solution exists. In this case $s_{0}>0$ and Lemma 3.7(i) yields that

By (19), this yields $Q_{2}(s_{0},X)\leq s_{0}(1-\tau)$ . By definition of $\Delta_{2}$ , this in turn implies that

An elementary calculation shows that if $Q_{2}^{\prime}(0,X)\leq(t-2\tau)/8\phi$ , then the solution $s_{0}$ exists and satisfies

where we used Lemma 3.7(i) in the last step.

We can now complete the proof of Lemma 3.6.

By Lemma 3.8(ii), we have $E_{1}\leq 1+t$ . Next, we estimate $E_{2}$ using Hölder’s inequality,

The two terms here can be estimated using Lemma 3.8(i) and Lemma 3.7 along with Markov’s inequality,

Finally, we set $t=\varepsilon/2$ and use the assumptions $\phi\leq\phi_{2}(\tau,\varepsilon)$ and $\tau<\varepsilon/2$ to conclude that $E_{2}\leq\varepsilon/2$ . Together with $E_{1}\leq 1+t=1+\varepsilon/2$ this implies

Although for convenience of application Lemma 3.6 is stated under the strong regularity assumption (SR), the latter is not used in the proof. The argument above uses only the weak regularity assumption (WR).

The spectral norm

In this section we prove Theorem 1.1 by showing that whenever $X_{1},\ldots,X_{N}$ are independent and satisfy (SR), the spectral norm estimate

obtained in Theorems 1.5 and 3.1. The basic idea is to show using independence that

is concentrated near its expectation of $1$ . Combining this with

which follows immediately from (22), yields (21).

We rely on the following elementary proposition regarding sums of independent random variables.

Postponing the proof of Proposition 4.1, we use this fact to control

Proof of Theorem 1.1 Assume the random vectors $X_{i}$ are isotropic and satisfy (SR) with parameters $C,\eta$ . This implies that the random variables

satisfy the requirements of Proposition 4.1 with parameters $C^{1+\eta},\eta$ . It follows that

Replacing $\varepsilon$ by $\varepsilon/3$ and taking

always satisfies (26). This completes the proof of the theorem.

Proof of Proposition 4.1 Fix a parameter $K>0$ , and decompose

By Jensen’s inequality, independence and the bound on $Z_{i}^{\prime}$ , we have

Moreover, by triangle and Jensen’s inequalities,

Choosing $K=(\varepsilon/2)\sqrt{N}$ and using the assumption on $N$ , one easily checks that

Appendix: Proof of Proposition 1.3

In this section we prove Proposition 1.3, which states that product distributions satisfy the regularity assumption in Theorem 1.1. Note that this result and its proof are not needed in the proof of Theorem 1.1.

The contribution of the diagonal of $P$ to this sum is

Denote by $P_{0}$ the matrix $P$ with diagonal removed; then

This inequality can be obtained from general decoupling results; see dlPG , Theorem 3.1.1; a simple and well-known proof of (2) is given in Vdecoupling .

Therefore, by conditioning on $X^{\prime}$ we obtain from (2) that

Since $P_{0}$ equals $P$ without the diagonal, the triangle inequality yields

Since $0<P_{ii}\leq\|P\|\leq 1$ , we can replace $P_{ii}^{2}$ by $P_{ii}$ , so

Putting (1), (3) and (4) together, we arrive at the inequality

Put in different words, the random variable $Z:=\|PX\|_{2}^{2}-D$ satisfies the inequality

Solving this quadratic inequality we obtain that

In order to bound $\|D\|_{L_{p}}$ we consider

Finally, by definition of $Z$ and using the triangle inequality and bounds (7), (6), we conclude that

Acknowledgments

The authors are grateful to the referees whose comments improved the presentation of the paper.