The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties

Roberto Imbuzeiro Oliveira

Introduction

The most basic problem is computing how many samples are needed to bring Σ^n\widehat{\Sigma}_{n} close to Σ\Sigma. One needs at least npn\geq p to bring Σ^n\widehat{\Sigma}_{n} close to Σ\Sigma, so that the ranks of the two matrices can match. A basic problem is to find conditons under which nC(ε)pn\geq C(\varepsilon)\,p samples are enough for guaranteeing

where C(ε)C(\varepsilon) depends only on ε>0\varepsilon>0 and on moment assumptions on the XiX_{i}’s.

A well known bound by Rudelson implies C(ε)plogpC(\varepsilon)\,p\log p samples are necessary and sufficient if the vectors Σ1/2Xi/p\Sigma^{-1/2}X_{i}/\sqrt{p} have uniformly bounded norms. Removing the logp\log p factor is relatively easy for subgaussian vectors XiX_{i}, but even the seemingly nice case of logconcave random vectors (which have subexponential moments) had to wait for the breakthrough papers by Adamczak et al . The current best results hold when the XiX_{i} and all of their projections have q>2q>2 moments , and when their one-dimensional marginals have q>8q>8 moments ; in the latter case one also needs (necessarily) a high probability bound on maxinXi\max_{i\leq n}\,|X_{i}|. None of those finite-moment results gives strong concentration bounds.

It turns out that, for many important applications, only the lower tail of Σ^n\widehat{\Sigma}_{n} matters. That is, we only need that vTΣ^nvv^{T}\widehat{\Sigma}_{n}v is not much smaller than vTΣvv^{T}\Sigma v for all vectors vv in a suitable set. Our main result in this paper is that this lower tail is subgaussian under extremely weak conditions. More precisely, we will prove that if there exists a h>0{\sf h}>0 such that

then n=O(h2p/ε2)n=O\left({\sf h}^{2}\,p/\varepsilon^{2}\right) samples are enough to guarantee an asymmetric version of (2), to wit:

This follows from a more precise result – Theorem 3.1 in Section 3 below – about the more general case of sums of independent and identically distributed positive semidefinite random matrices. We note that the dependence on ε2\varepsilon^{-2} in our bound is optimal for vectors with independent coordinates, as can be shown via the Bai-Yin theorem .

Let us briefly comment on some proof ideas we think might be useful elsewhere. Theorem 3.1, our main result, is proven via so called PAC Bayesian methods and is inspired by the recent paper by Audibert and Catoni . We will see that this method allows one to translate properties of moment generating functions of individual random variables into uniform control of certain empirical processes. This is discussed in more detail in Section 3.2.

Organization: The next section covers some preliminaries and defines the notation we use. Section 3 contains the statement and proof of the main result, Theorem 3.1, along with a discussion of the assumptions and a proof overview. Section 4 presents our result on ordinary least squares, giving some background for the problem. Section 5 follows a similar format for restricted eigenvalues. The final section presents some remarks and open problems. Two Appendices contain a discussion of our improvement over , and some estimates used in the main text.

Notation and preliminaries

The restriction of vv to a subset S{1,,p}S\subset\{1,\dots,p\} is the vector vSv_{S} with vS[j]=v[j]v_{S}[j]=v[j] for jSj\in S and vS[k]=0v_{S}[k]=0 for k∉Sk\not\in S.

We use asymptotic notation somewhat informally, in order to illustrate our results with clean statements. We write a=o(b)a=o\left(b\right) or aba\ll b to indicate that a/b|a/b| is very small, and a=O(b)a=O\left(b\right) to say that a/b|a/b| is bounded by a universal constant.

Finally, we state for later use the Burkholder-Davis-Gundy inequality. Let (Mi,Fi)i=1n(M_{i},\mathcal{F}_{i})_{i=1}^{n} denote a martingale with finite qq-th moments (q2q\geq 2) and M0=0M_{0}=0 . Then:

Note that the first inequality above is the BDG inequality with optimal constant, and the second inequality follows from Minkowski’s inequality for the Lq/2L^{q/2} norm. We also observe that (7) implies a result for W1,,WnW_{1},\dots,W_{n} which are i.i.d. random variables:

Better inequalities are known in this case, but we will use (8) for simplicity.

The subgaussian lower tail

The goal of this section is to discuss and prove our main result.

Let us recall that in the vector case Ai=XiXiTA_{i}=X_{i}X_{i}^{T} the main assumption we need is that

An obvious case where (9) holds is when X,,X[p]X,\dots,X[p] are independent, have finite fourth moments and mean . A short calculation shows that we may take

Significantly, the same calculations also work when X,,X[p]X,\dots,X[p] are four-wise independent; this will be interesting when considering compressed sensing-type applications (cf. Example 1 below). Changing to 22h2\sqrt{2}\,{\sf h} allows us to consider translations and linear transformations of XX.

These particular cases include many important examples, such as gaussian, subgaussian, logconcave vectors and their affine transformations. There are also many examples with unbounded 4+ε4+\varepsilon moments. If we multiply XX by an independent scalar ξ\xi with

we just need to replace h{\sf h} with hh{\sf h}\,{\sf h}_{*}. Interestingly, the upper tail of Σ^n\widehat{\Sigma}_{n} is quite sensitive to this kind of transformation. Even multiplying by a Gaussian random variable may result in an ensemble that does not obey the analogue of the main theorem (cf. the discussion in [30, Section 1.8]).

2 Proof overview and a preliminary PAC Bayesian result

is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have subgaussian lower tails under weak assumptions; see eg. Lemma B.2 below.

To make these ideas more definite we present a technical result that encapsulates the main ideas in our PAC Bayesian approach. This requires some conditions.

are well defined and depend continuously on vv. We will use the notation Γv,Cfθ\Gamma_{v,C}f_{\theta} to denote the integral of fθf_{\theta} (which may also depend on other parameteres) over the variable θ\theta with the measure Γv,C\Gamma_{v,C}.

In the next subsection we will apply this to prove Theorem 3.1. Here is a brief overview: we will performe a change of cordinates under which Σ=Ip×p\Sigma=I_{p\times p}. We will then define ZθZ_{\theta} as

is a new term introduced by the“smoothing operator” Γv,γC\Gamma_{v,\gamma C}. The choice γ=1/p\gamma=1/p will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator Σ^n\widehat{\Sigma}_{n} to ensure that SvS_{v} is highly concentrated.

Proof: [of Proposition 3.1] As a preliminary step, we note that under our assumptions the map:

is measurable. This implies that the event in the statement of the proposition is indeed a measurable set.

To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space (Θ,G)(\Theta,\mathcal{G}):

But this follows from Markov’s inequality and Fubini’s Theorem:

3 Proof of the main result

The goal of our proof is to show that, for any t0t\geq 0:

Replacing vv with Σ1/2v\Sigma^{-1/2}v above and using homogeneity reduces this goal to showing:

This is what we will show in the remainder of the proof.

Fix some R>0R>0 and define (with hindsight) truncated operators

with the convention that this is simply if tr(Bi)=0{\rm tr}(B_{i})=0. We collect some estimates for later use.

Fix ξ>0\xi>0. We will apply Proposition 3.1 with C=Ip×p/pC=I_{p\times p}/p and

plus the fact that, for any non-negative, square-integrable random variable WW,

(this is shown in the proof of Lemma B.2 in the Appendix). We deduce from Proposition 3.1 that, with probability 1et\geq 1-e^{-t},

The first two terms inside the brackets are non-negative and, by Cauchy Schwartz, the absolute value of the rightmost term is at most the sum of the other two. We deduce:

Taking expectations, applying Lemma 3.1 and recalling v2=1|v|_{2}=1 gives:

This holds for any R>0R>0. Optimizing over RR gives:

The overestimates 2/312/3\leq 1, 245224\leq 5^{2} and 0p0\leq p finish the proof of (15). This in turn finishes the proof of Theorem 3.1 except for Lemma 3.1, which is provn below. \Box

Proof: [of Lemma 3.1] The first item is immediate. The third item follows from tr(BiR)tr(Bi){\rm tr}(B_{i}^{R})\leq{\rm tr}(B_{i}) and Lemma B.1 in Section B.1.

Combining the last three inequalities finishes the proof. \Box

It is instructive to compare this proof with what one would obtain without truncation. In that case everything would go through except for the step where we apply Bennett’s inequality.

Ordinary least squares under random design

as small as possible. In other words, one is trying to find a linear combination of the coordinates of XX that is as close as possible to YY in terms of mean-square error. The random design setting should be contrasted with the technically simpler case of fixed design, where the XiX_{i}’s are assumed fixed and all randomness is in the YiY_{i}’s. Results about this setting are not indicative about out-of-sample prediction, a crucial property in many tasks where least squares is routinely used, as well as in theoretical problems such as linear aggregration; see for further discussion.

This estimator is not hard to study when nn is large, pp is much smaller than nn and a linear model is assumed:

Here we want to consider a completely model-free, non-parametric setting where no specific relationship between XX and YY is assumed. Moreover, we want to allow for large pp, with the only condition is that p/np/n should be small. This rules out using classical asymptotic theory (which is not quantitative) as well as Barry-Esséen-type bounds (which do not work for pn2/3p\gg n^{2/3}; see for the best known bounds).

2 Our result, and previous work

Choose δ,η,ε(0,1)\delta,\eta,\varepsilon\in(0,1) and assume:

The proof of Theorem 4.1 consists of three steps. One is to use an explicit expression for OLS in order to express β^nβmin\widehat{\beta}_{n}-{\beta}_{\min}. Theorem 3.1 is used to prove that a matrix that appears in the expression for this difference has bounded norm. The third step is to control the remaining expression, which is a sum of i.i.d. random vectors that we analyze via Lemma 4.1 below.

3 The proof

Proof: [of Theorem 4.1] We will assume that Σ\Sigma has full rank; the general case follows from a simple perturbation argument. We also define Σ^n\widehat{\Sigma}_{n} as in (1), that is,

The assumptions on XX of Theorem 4.1 imply those of Theorem 3.1 (with Ai=XiXiTA_{i}=X_{i}X_{i}^{T}). Tis implies that the event

The ZiZ_{i} are independent vectors whose law is the same as that of ZZ in Theorem 4.1. This implies that the following Lemma may be applied.

The Lemma implies that the event Vector defined below,

The two rightmost terms in the previous display are bounded via Lower and Vector. To see this we begin by applying (6) above:

Plugging these bounds into (27) results in

and this inequality holds whenever LowerVector{\sf Lower}\cap{\sf Vector} occurs. In particular, the probability of the last display satisfies the bound claimed in the Theorem. \Box

4 Proof of the auxiliary result on sums of random vectors

Proof: [of Lemma 4.1] Write S0=0S_{0}=0 and Si=Si1+ZiS_{i}=S_{i-1}+Z_{i}, 1in1\leq i\leq n. We note that:

is a martingale with respect to the filtration F0={,Ω}\mathcal{F}_{0}=\{\emptyset,\Omega\},

Now define hiSi/Sih_{i}\equiv S_{i}/|S_{i}| if Si0|S_{i}|\neq 0, and hi=0h_{i}=0 otherwise. The following random variable will be important later on.

We will use the following estimates (proven subsequently).

We will also use the following simple fact about martingales, which we prove in the appendix

Suppose {Ni}i=0n\{N_{i}\}_{i=0}^{n} is a square-integrable martingale with respect to a filtration {Gi}i=0n\{\mathcal{G}_{i}\}_{i=0}^{n}. Define W0=0W_{0}=0 and

Combining this with Claim 1 and the definition of VnV_{n} shows that, for any choice of ξ>0\xi>0, 0<α<10<\alpha<1, we have:

Now fix some α\alpha, make the choice of

and apply (28) to the value i{1,,n}i_{*}\in\{1,\dots,n\} achieving the maximum of Si|S_{i}|. We have that, with probability 1δ\geq 1-\delta,

with probability 1δ.\geq 1-\delta. This is precisely the desired result once we choose:

Proof: [of Claim 1] We prove the second (harder) assertion first. Note that

is a martingale with respect to the filtration {Fi}i=1n\{\mathcal{F}_{i}\}_{i=1}^{n}. The Burkholder-Davis-Gundy inequality (7) implies, for any q2q\geq 2,

Using again that hj11|h_{j-1}|\leq 1 always, j=1nΛ1/2hj1nλmax(Λ)\sum_{j=1}^{n}|\Lambda^{1/2}h_{j-1}|\leq n\lambda_{\max}(\Lambda). We deduce:

Restricted eigenvalues in high dimensions

where ϵ1,,ϵn\epsilon_{1},\dots,\epsilon_{n} represent some kind of noise and – most importantly – the dimension pp may greatly exceed the number nn of measurements. The aforementioned fields tend to interpret this setup in different ways. Whereas in Compressed Sensing one tends to think of the xix_{i}’s as measurement vectors as controlled by the “experimenter”, for a statistician the xix_{i} and YiY_{i} are generated by a random process that is not under control (and the whole problem corresponds to linear regression pnp\gg n; Section 5.3 below).

It should be clear that, given pnp\gg n, the above problem is severely underdetermined. However, sparsity may be used as a key enabling assumption. It is known that if the vector βmin{\beta}_{\min} has sn/logps\ll n/\log p non-zero coordinates, then it may be recovered up to error of the order σ2slogp/n\sigma^{2}s\log p/n. This is only O(logp)O\left(\log p\right) times larger than the error of OLS which “knows” the support of βmin{\beta}_{\min}. Most importantly, there are computationally efficient estimators achieving this rate. These developments and their extensions comprise a vast literature which we will not try to survey; we refer instead to a recent book and a handful of important papers for more information on these topics.

Computationally efficient estimators achieving this rate require certain conditions besides sparsity. Denote by X^n\widehat{\bf X}_{n} the design matrix:

Several sufficient conditions on X^n\widehat{\bf X}_{n} are known to ensure the fast rates we have described, including uniform uncertainty principles, restricted isometry, sparse eigenvalues and incoherence; see eg. and especially the paper where these conditions are compared. In this paper we focus on so-called restricted eigenvalue conditions, which are amongst the least restrictive in this class.

(Here vSv_{S} denotes the restriction of vv to SS, cf. Section 2.) The restricted eigenvalue constant for (A,S,α)(A,S,\alpha), denoted by re(A,S,α){\sf re}(A,S,\alpha), is the largest value of R>0R>0 such that:

Moreover, re(A,s,α){\sf re}(A,s,\alpha) is the minimum of re(A,S,α){\sf re}(A,S,\alpha) over S{1,,p}S\subset\{1,\dots,p\} with S=s|S|=s.

In the setting of (29) one may take SS as the support of βmin{\beta}_{\min}. Assuming re(X^n,S,α){\sf re}(\widehat{\bf X}_{n},S,\alpha) is bounded for some specific α>0\alpha>0 ensures that estimators such as the Dantzig selector and the LASSO will achieve the near-OLS error rate defined above. Here is one example by Buhlmann and Van der Geer which may be applied to a fixed-design linear regression model

Then there exists a choice of λ=λ(σ2,n,p)\lambda=\lambda(\sigma^{2},n,p) such that, with probability 1p2\geq 1-p^{-2}:

where c>0c>0 depends only on re(X^n,S,3){\sf re}(\widehat{\bf X}_{n},S,3).

We emphasize that this estimator has performance which nearly matches that of OLS when the support of βmin{\beta}_{\min} is known. Similar results could be achieved by trying all potential supports: the merit of the LASSO and related methods is computational efficientcy.

We note in passing that there is also a fairly sizable literature on how well the LASSO and other methods do when βmin{\beta}_{\min} is only approximately sparse and a linear model is not necessarily valid. We will mostly refrain from discussing this in what follows, and refer to for further discussion of this topic.

2 Our result, and related work

Define diagonal matrices D^2,n\widehat{D}_{2,n} and D2D_{2} corresponding to the diagonals of Σ^n\widehat{\Sigma}_{n} and Σ\Sigma (respectively). Set

and XD21/2ΣD21/2{\bf X}\equiv D_{2}^{-1/2}\Sigma D_{2}^{-1/2} with the convention that the (j,j)(j,j)th entry of D^2,n1/2\widehat{D}_{2,n}^{-1/2} (resp. D21/2D_{2}^{-1/2}) is zero whenever the corresponding entry of D^2,n\widehat{D}_{2,n} (resp. D2D_{2}) is zero. Assume that δ,ε(0,1/2)\delta,\varepsilon\in(0,1/2) and S{1,,p}S\subset\{1,\dots,p\} with cardinality S=s|S|=s, and set

Then the following three properties hold simultaneously with probability 1δ\geq 1-\delta.

For any xx as above, xTΣ^nx(1ε)2xTΣx.x^{T}\widehat{\Sigma}_{n}x\geq(1-\varepsilon)^{2}\,x^{T}\Sigma\,x.

Let us note the main differences between this theorem and the results in : our theorem holds for a specific choice of S{1,,p}S\subset\{1,\dots,p\} – ie. it is not uniform over SS with S=s|S|=s – and uses the “normalized” matrix X^n\widehat{\bf X}_{n} instead of Σ^n\widehat{\Sigma}_{n}. Both differences are related to our moment assumptions, and both turn out not to be problematic in certain scenarios, such as “randomized, RIPless compressed sensing” and statistical regression problems, where one wants to solve one problem instance and uniform guarantees are unnecessary (cf. Section 5.3 below). We note that the normalization on Σ^n\widehat{\Sigma}_{n} is farly natural, at it ensures the “unit diagonal” condition in Theorem 5.1. We also note that stronger moment assumptions allow for stronger conclusions via the same proof methods; we illustrate this with a simple example.

3 A digression on linear regression with random design

Theorem 5.2 is quite obviously applicable to fixed design regression as in Theorem 5.1. As it turns out, it may also be applied to the random design setting discussed in Section 4.1 when the dimension pp is much greater than the number of samples nn. For simplicity we will focus on how this is done in the linear model setting (22), in which case we may apply Theorem 5.1 directly; a general model analysis would require ideas from .

with probability 1O(p2)1-O\left(p^{-2}\right), as long as re(X,S,c)>0{\sf re}({\bf X},S,c)>0 for some c>3c>3 and the other parameters are chosen in their proper ranges.

4 Proof ideas, and the transfer principle

Suppose Σ^n\widehat{\Sigma}_{n} and Σ\Sigma are matrices with non-negative diagonal entries, and assume η(0,1)\eta\in(0,1), d{1,,p}d\in\{1,\dots,p\} are such that

Assume DD is a diagonal matrix whose elements D[j,j]D[j,j] are non-negative and satisfy D[j,j]Σ^n[j,j](1η)Σ[j,j]D[j,j]\geq\widehat{\Sigma}_{n}[j,j]-(1-\eta)\,\Sigma[j,j]. Then

Raskutti et al. prove such a bound directly for Gaussian ensembles, and note that it implies the restricted eigenvalue property when the population design matrix has this property. In our case we use Theorem 3.1 to control of Σ^n\widehat{\Sigma}_{n} over sparse vectors, and combine it with this Lemma to obtain the appropriate control over the cone C(S,α)\mathcal{C}(S,\alpha). As noted in the introduction, this Transfer Principle implies a version of the main result of Rudelson and Zhou ; see Appendix A for details.

Proof: [of Lemma 5.1] We assume DD is invertible; the general case follows via a simple continuity argument. We also set

Notice that vTAv0v^{T}Av\geq 0 for all dd-sparse vectors, and also that 0A[j,j]10\leq A[j,j]\leq 1 for each 1jp1\leq j\leq p. We will prove that:

which implies the Lemma once we set y=D1/2xy=D^{1/2}x.

To prove ()(\star) we use a probabilistic argument related to Maurey’s empirical method. We may write

where sj{1,+1}s_{j}\in\{-1,+1\} is the sign of y[j]y[j] and pjy[j]/y1p_{j}\equiv|y[j]|/|y|_{1}.

The pjp_{j}’s are non-negative and sum to 11. Therefore we may define v1,,vdv_{1},\dots,v_{d} to be independent and identically distributed random vectors with the following distribution:

has at most dd nonzero coordinates, so vTAv0v^{T}Av\geq 0. Taking expectations, we see that:

It remains to compute this expectation. When iri\neq r, viv_{i} and vrv_{r} are independent and:

When i=ri=r and vi=y1sjejv_{i}=|y|_{1}\,s_{j}\,e_{j} we see that viTAvi=y12A[j,j]y12v_{i}^{T}Av_{i}=|y|_{1}^{2}A[j,j]\leq|y|_{1}^{2} because A[j,j]1A[j,j]\leq 1 for each jj. Thus

Combining the two previous displays with (32) gives

5 Proof

In this section we prove Theorem 5.2. The first step is where we use Theorem 3.1 and Lemma 5.1.

Under the assumptions of Theorem 5.2, the following event holds with probability 1δ/3\geq 1-\delta/3:

by the assumptions of Theorem 5.2. We begin by proving that:

More specifically, this follows from Theorem 3.1 with dd replacing pp and 8δ/pd8\delta/p^{d} replacing δ\delta. Notice that with these choices

Plugging this into (35) and applying a union bound over UU gives:

This gives our first goal (34). To obtain the Lemma from this, we apply the transfer principle (Lemma 5.1) whenever Lower holds, using (33) to deduce

Proof: [of Theorem 5.2] We define Lower as in Lemma 5.2, and consider two other events:

This proves C1 in the Theorem, and also allows us to obtain:

Combining the final bound with Lower and using our assumption on nn we obtain

This is C2. Finally, note that Diag{\sf Diag}_{-} implies

Since this holds for any D^2,n1/2x\widehat{D}_{2,n}^{1/2}x as above we may use the substitution y=D^2,nxy=\widehat{D}_{2,n}x to conclude:

We have proved that C1, C2 and C3 hold in the intersection Lower\capDiag+{}_{+}\capDiag-. We now estimate the probability of this intersection by showing that each event has probability 1δ/3\geq 1-\delta/3. We already have this lower bound for Lower from Lemma 5.2. For Diag- we will use Lemma B.2 in the appendix. Note that for each 1jp1\leq j\leq p

Applying Lemma B.2 for each jj, with the choice t=ε2n/2h2t=\varepsilon^{2}n/2{\sf h}^{2}, gives

by our assumptions on the parameters. \Box

Final remarks

LASSO-type estimators like the one described here have been analyzed without restricted eigenvalue assumptions. Bartlett, Mendelson and Neeman prove that the LASSO acts as a penalized least squares regressor satisfying a sharp oracle inequality. While we do not pursue this here, one could prove such a result under weak moment assumptions similar to those of Theorem 4.1, but allowing for nlogpn\gg\log p. The recipe would be to start from Lemma 5.2 above and combine the “self-normalization” ideas in the proof of Theorem 5.2 with standard methods for obtaining sharp oracle inequalities . However, in this case the penalty of the LASSO would scale as O(lnp/nD^4,n1/41)O\left(\sqrt{\ln p/n}\,|\widehat{D}_{4,n}^{1/4}|_{1}\right), where D^4,n\widehat{D}_{4,n} contains n1iXi[j]4n^{-1}\sum_{i}X_{i}[j]^{4} on the diagonal (1jp1\leq j\leq p) and zeros elsewhere.

An interesting queston is whether one can use PAC Bayesian methods to impove upon the upper tail in random covariance matrix estimation. It seems that at least some of the constants in references could be improved by this approach. The main idea would be to use self-normalized concentration inequalities to compensate for the lack of infinitely many moments.

Another intersting problem (in the context of Theorem 5.2) is to investigate whether the PAC Bayesian method can improve on the known recovery guarantees for vectors with bounded entries . One may try to achieve this via a different choice of smoothig distribution and it seems likely that one of those choices will improve e.g. the best known bounds for sampling rows of an orthogonal matrix. This would also have some bearing on the performance of the LASSO.

Appendix A An improvement over the result of Rudelson and Zhou

In this appendix we discuss how our Transfer Principle (Lemma 5.1) can be applied to obtain an improvement of a recent result by Rudelson and Zhou . The notation and definitions from Section 5.1 are taken for granted.

The goal of was to show that if Σ^n\widehat{\Sigma}_{n} “acts like” Σ\Sigma over sparse vectors, it necessarily inherits restricted eigenvalue properties from Σ\Sigma. This is important because dealing directly with the restricted eigenvalues might be complicated, whereas controlling Σ^n\widehat{\Sigma}_{n} over sparse vectors is typically much easier. Their precise result reads as followsThe reader should beware that our notation and our definition of restricted eigenvalues do not coincide with that of . What follows is a “translation” to our language.:

Then xC(S,α):xTΣ^nx(13ε/2)xTΣx.\forall x\in\mathcal{C}(S,\alpha)\,:\,x^{T}\widehat{\Sigma}_{n}\,x\geq(1-3\varepsilon/2)\,\,x^{T}\Sigma\,x. In particular,

The main conceptual difference between these two Theorems is that the condition (38) implies (39) and (40) with γε\gamma\approx\varepsilon (and a slightly different ε\varepsilon). Moreover, we do not require bounds on re(Σ,s,3α){\sf re}(\Sigma,s,3\alpha), and the numerical constants in our result are better. We also note that the full proof of Theorem A.2 (which includes Lemma 5.1 above) is quite simple and about two pages long.

Proof: [of Theorem A.2] By our assumptions, we have

We also have the condition vTΣ^nv(1ε)vTΣvv^{T}\widehat{\Sigma}_{n}v\geq(1-\varepsilon)\,v^{T}\Sigma v for all dd-sparse vv. We may apply Lemma 5.1 with D=MIp×pD=M\,I_{p\times p} to conclude:

We now restrict attention to xC(S,α)x\in\mathcal{C}(S,\alpha), noting that

by the definitions of MM and dd. Plugging this back into (41) gives:

By the definition of re(Σ,S,α){\sf re}(\Sigma,S,\alpha), we also have

and apply this to ξ=3ε/2\xi=3\varepsilon/2 (which is 3/4\leq 3/4 because ε1/2\varepsilon\leq 1/2). \Box

Appendix B Technical estimates

note that we also used (implicitly) the fact that ejTAej0{e_{j}}^{T}{Ae_{j}}\geq 0 for all jj. \Box

B.2 Lower tail concentration for non-negative random variables

Let W1,,Wn[0,+)W_{1},\dots,W_{n}\in[0,+\infty) be independent non-negative random variables with finite second moments. Then:

Apply this to x=ξWix=\xi W_{i} (with ξ>0\xi>0) and integrate to deduce:

Now use the independent of the WiW_{i}’s to obtain:

The usual Bernstein’s trick finishes the proof. More specifically, we note that, for any λ>0\lambda>0

then bound the RHS via (42) and optimize in ξ\xi. \Box

B.3 Proof of Proposition 4.1

Proof: [of Proposition 4.1] The key step in the proof is to show that the sequence of random variables U01U_{0}\equiv 1,

To prove that UiU_{i} is indeed a supermartingale, note that:

and the conditional Jensen inequality implies:

Note that DDD-D^{\prime} is a symmetric random variable. Therefore:

References