The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties

Roberto Imbuzeiro Oliveira

Introduction

The most basic problem is computing how many samples are needed to bring $\widehat{\Sigma}_{n}$ close to $\Sigma$ . One needs at least $n\geq p$ to bring $\widehat{\Sigma}_{n}$ close to $\Sigma$ , so that the ranks of the two matrices can match. A basic problem is to find conditons under which $n\geq C(\varepsilon)\,p$ samples are enough for guaranteeing

where $C(\varepsilon)$ depends only on $\varepsilon>0$ and on moment assumptions on the $X_{i}$ ’s.

A well known bound by Rudelson implies $C(\varepsilon)\,p\log p$ samples are necessary and sufficient if the vectors $\Sigma^{-1/2}X_{i}/\sqrt{p}$ have uniformly bounded norms. Removing the $\log p$ factor is relatively easy for subgaussian vectors $X_{i}$ , but even the seemingly nice case of logconcave random vectors (which have subexponential moments) had to wait for the breakthrough papers by Adamczak et al . The current best results hold when the $X_{i}$ and all of their projections have $q>2$ moments , and when their one-dimensional marginals have $q>8$ moments ; in the latter case one also needs (necessarily) a high probability bound on $\max_{i\leq n}\,|X_{i}|$ . None of those finite-moment results gives strong concentration bounds.

It turns out that, for many important applications, only the lower tail of $\widehat{\Sigma}_{n}$ matters. That is, we only need that $v^{T}\widehat{\Sigma}_{n}v$ is not much smaller than $v^{T}\Sigma v$ for all vectors $v$ in a suitable set. Our main result in this paper is that this lower tail is subgaussian under extremely weak conditions. More precisely, we will prove that if there exists a ${\sf h}>0$ such that

then $n=O\left({\sf h}^{2}\,p/\varepsilon^{2}\right)$ samples are enough to guarantee an asymmetric version of (2), to wit:

This follows from a more precise result – Theorem 3.1 in Section 3 below – about the more general case of sums of independent and identically distributed positive semidefinite random matrices. We note that the dependence on $\varepsilon^{-2}$ in our bound is optimal for vectors with independent coordinates, as can be shown via the Bai-Yin theorem .

Let us briefly comment on some proof ideas we think might be useful elsewhere. Theorem 3.1, our main result, is proven via so called PAC Bayesian methods and is inspired by the recent paper by Audibert and Catoni . We will see that this method allows one to translate properties of moment generating functions of individual random variables into uniform control of certain empirical processes. This is discussed in more detail in Section 3.2.

Organization: The next section covers some preliminaries and defines the notation we use. Section 3 contains the statement and proof of the main result, Theorem 3.1, along with a discussion of the assumptions and a proof overview. Section 4 presents our result on ordinary least squares, giving some background for the problem. Section 5 follows a similar format for restricted eigenvalues. The final section presents some remarks and open problems. Two Appendices contain a discussion of our improvement over , and some estimates used in the main text.

Notation and preliminaries

The restriction of $v$ to a subset $S\subset\{1,\dots,p\}$ is the vector $v_{S}$ with $v_{S}[j]=v[j]$ for $j\in S$ and $v_{S}[k]=0$ for $k\not\in S$ .

We use asymptotic notation somewhat informally, in order to illustrate our results with clean statements. We write $a=o\left(b\right)$ or $a\ll b$ to indicate that $|a/b|$ is very small, and $a=O\left(b\right)$ to say that $|a/b|$ is bounded by a universal constant.

Finally, we state for later use the Burkholder-Davis-Gundy inequality. Let $(M_{i},\mathcal{F}_{i})_{i=1}^{n}$ denote a martingale with finite $q$ -th moments ( $q\geq 2$ ) and $M_{0}=0$ . Then:

Note that the first inequality above is the BDG inequality with optimal constant, and the second inequality follows from Minkowski’s inequality for the $L^{q/2}$ norm. We also observe that (7) implies a result for $W_{1},\dots,W_{n}$ which are i.i.d. random variables:

Better inequalities are known in this case, but we will use (8) for simplicity.

The subgaussian lower tail

The goal of this section is to discuss and prove our main result.

Let us recall that in the vector case $A_{i}=X_{i}X_{i}^{T}$ the main assumption we need is that

An obvious case where (9) holds is when $X,\dots,X[p]$ are independent, have finite fourth moments and mean . A short calculation shows that we may take

Significantly, the same calculations also work when $X,\dots,X[p]$ are four-wise independent; this will be interesting when considering compressed sensing-type applications (cf. Example 1 below). Changing to $2\sqrt{2}\,{\sf h}$ allows us to consider translations and linear transformations of $X$ .

These particular cases include many important examples, such as gaussian, subgaussian, logconcave vectors and their affine transformations. There are also many examples with unbounded $4+\varepsilon$ moments. If we multiply $X$ by an independent scalar $\xi$ with

we just need to replace ${\sf h}$ with ${\sf h}\,{\sf h}_{*}$ . Interestingly, the upper tail of $\widehat{\Sigma}_{n}$ is quite sensitive to this kind of transformation. Even multiplying by a Gaussian random variable may result in an ensemble that does not obey the analogue of the main theorem (cf. the discussion in [30, Section 1.8]).

2 Proof overview and a preliminary PAC Bayesian result

is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have subgaussian lower tails under weak assumptions; see eg. Lemma B.2 below.

To make these ideas more definite we present a technical result that encapsulates the main ideas in our PAC Bayesian approach. This requires some conditions.

are well defined and depend continuously on $v$ . We will use the notation $\Gamma_{v,C}f_{\theta}$ to denote the integral of $f_{\theta}$ (which may also depend on other parameteres) over the variable $\theta$ with the measure $\Gamma_{v,C}$ .

In the next subsection we will apply this to prove Theorem 3.1. Here is a brief overview: we will performe a change of cordinates under which $\Sigma=I_{p\times p}$ . We will then define $Z_{\theta}$ as

is a new term introduced by the“smoothing operator” $\Gamma_{v,\gamma C}$ . The choice $\gamma=1/p$ will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator $\widehat{\Sigma}_{n}$ to ensure that $S_{v}$ is highly concentrated.

Proof: [of Proposition 3.1] As a preliminary step, we note that under our assumptions the map:

is measurable. This implies that the event in the statement of the proposition is indeed a measurable set.

To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space $(\Theta,\mathcal{G})$ :

But this follows from Markov’s inequality and Fubini’s Theorem:

3 Proof of the main result

The goal of our proof is to show that, for any $t\geq 0$ :

Replacing $v$ with $\Sigma^{-1/2}v$ above and using homogeneity reduces this goal to showing:

This is what we will show in the remainder of the proof.

Fix some $R>0$ and define (with hindsight) truncated operators

with the convention that this is simply if ${\rm tr}(B_{i})=0$ . We collect some estimates for later use.

Fix $\xi>0$ . We will apply Proposition 3.1 with $C=I_{p\times p}/p$ and

plus the fact that, for any non-negative, square-integrable random variable $W$ ,

(this is shown in the proof of Lemma B.2 in the Appendix). We deduce from Proposition 3.1 that, with probability $\geq 1-e^{-t}$ ,

The first two terms inside the brackets are non-negative and, by Cauchy Schwartz, the absolute value of the rightmost term is at most the sum of the other two. We deduce:

Taking expectations, applying Lemma 3.1 and recalling $|v|_{2}=1$ gives:

This holds for any $R>0$ . Optimizing over $R$ gives:

The overestimates $2/3\leq 1$ , $24\leq 5^{2}$ and $0\leq p$ finish the proof of (15). This in turn finishes the proof of Theorem 3.1 except for Lemma 3.1, which is provn below. $\Box$

Proof: [of Lemma 3.1] The first item is immediate. The third item follows from ${\rm tr}(B_{i}^{R})\leq{\rm tr}(B_{i})$ and Lemma B.1 in Section B.1.

Combining the last three inequalities finishes the proof. $\Box$

It is instructive to compare this proof with what one would obtain without truncation. In that case everything would go through except for the step where we apply Bennett’s inequality.

Ordinary least squares under random design

as small as possible. In other words, one is trying to find a linear combination of the coordinates of $X$ that is as close as possible to $Y$ in terms of mean-square error. The random design setting should be contrasted with the technically simpler case of fixed design, where the $X_{i}$ ’s are assumed fixed and all randomness is in the $Y_{i}$ ’s. Results about this setting are not indicative about out-of-sample prediction, a crucial property in many tasks where least squares is routinely used, as well as in theoretical problems such as linear aggregration; see for further discussion.

This estimator is not hard to study when $n$ is large, $p$ is much smaller than $n$ and a linear model is assumed:

Here we want to consider a completely model-free, non-parametric setting where no specific relationship between $X$ and $Y$ is assumed. Moreover, we want to allow for large $p$ , with the only condition is that $p/n$ should be small. This rules out using classical asymptotic theory (which is not quantitative) as well as Barry-Esséen-type bounds (which do not work for $p\gg n^{2/3}$ ; see for the best known bounds).

2 Our result, and previous work

Choose $\delta,\eta,\varepsilon\in(0,1)$ and assume:

The proof of Theorem 4.1 consists of three steps. One is to use an explicit expression for OLS in order to express $\widehat{\beta}_{n}-{\beta}_{\min}$ . Theorem 3.1 is used to prove that a matrix that appears in the expression for this difference has bounded norm. The third step is to control the remaining expression, which is a sum of i.i.d. random vectors that we analyze via Lemma 4.1 below.

3 The proof

Proof: [of Theorem 4.1] We will assume that $\Sigma$ has full rank; the general case follows from a simple perturbation argument. We also define $\widehat{\Sigma}_{n}$ as in (1), that is,

The assumptions on $X$ of Theorem 4.1 imply those of Theorem 3.1 (with $A_{i}=X_{i}X_{i}^{T}$ ). Tis implies that the event

The $Z_{i}$ are independent vectors whose law is the same as that of $Z$ in Theorem 4.1. This implies that the following Lemma may be applied.

The Lemma implies that the event Vector defined below,

The two rightmost terms in the previous display are bounded via Lower and Vector. To see this we begin by applying (6) above:

Plugging these bounds into (27) results in

and this inequality holds whenever ${\sf Lower}\cap{\sf Vector}$ occurs. In particular, the probability of the last display satisfies the bound claimed in the Theorem. $\Box$

4 Proof of the auxiliary result on sums of random vectors

Proof: [of Lemma 4.1] Write $S_{0}=0$ and $S_{i}=S_{i-1}+Z_{i}$ , $1\leq i\leq n$ . We note that:

is a martingale with respect to the filtration $\mathcal{F}_{0}=\{\emptyset,\Omega\}$ ,

Now define $h_{i}\equiv S_{i}/|S_{i}|$ if $|S_{i}|\neq 0$ , and $h_{i}=0$ otherwise. The following random variable will be important later on.

We will use the following estimates (proven subsequently).

We will also use the following simple fact about martingales, which we prove in the appendix

Suppose $\{N_{i}\}_{i=0}^{n}$ is a square-integrable martingale with respect to a filtration $\{\mathcal{G}_{i}\}_{i=0}^{n}$ . Define $W_{0}=0$ and

Combining this with Claim 1 and the definition of $V_{n}$ shows that, for any choice of $\xi>0$ , $0<\alpha<1$ , we have:

Now fix some $\alpha$ , make the choice of

and apply (28) to the value $i_{*}\in\{1,\dots,n\}$ achieving the maximum of $|S_{i}|$ . We have that, with probability $\geq 1-\delta$ ,

with probability $\geq 1-\delta.$ This is precisely the desired result once we choose:

Proof: [of Claim 1] We prove the second (harder) assertion first. Note that

is a martingale with respect to the filtration $\{\mathcal{F}_{i}\}_{i=1}^{n}$ . The Burkholder-Davis-Gundy inequality (7) implies, for any $q\geq 2$ ,

Using again that $|h_{j-1}|\leq 1$ always, $\sum_{j=1}^{n}|\Lambda^{1/2}h_{j-1}|\leq n\lambda_{\max}(\Lambda)$ . We deduce:

Restricted eigenvalues in high dimensions

where $\epsilon_{1},\dots,\epsilon_{n}$ represent some kind of noise and – most importantly – the dimension $p$ may greatly exceed the number $n$ of measurements. The aforementioned fields tend to interpret this setup in different ways. Whereas in Compressed Sensing one tends to think of the $x_{i}$ ’s as measurement vectors as controlled by the “experimenter”, for a statistician the $x_{i}$ and $Y_{i}$ are generated by a random process that is not under control (and the whole problem corresponds to linear regression $p\gg n$ ; Section 5.3 below).

It should be clear that, given $p\gg n$ , the above problem is severely underdetermined. However, sparsity may be used as a key enabling assumption. It is known that if the vector ${\beta}_{\min}$ has $s\ll n/\log p$ non-zero coordinates, then it may be recovered up to error of the order $\sigma^{2}s\log p/n$ . This is only $O\left(\log p\right)$ times larger than the error of OLS which “knows” the support of ${\beta}_{\min}$ . Most importantly, there are computationally efficient estimators achieving this rate. These developments and their extensions comprise a vast literature which we will not try to survey; we refer instead to a recent book and a handful of important papers for more information on these topics.

Computationally efficient estimators achieving this rate require certain conditions besides sparsity. Denote by $\widehat{\bf X}_{n}$ the design matrix:

Several sufficient conditions on $\widehat{\bf X}_{n}$ are known to ensure the fast rates we have described, including uniform uncertainty principles, restricted isometry, sparse eigenvalues and incoherence; see eg. and especially the paper where these conditions are compared. In this paper we focus on so-called restricted eigenvalue conditions, which are amongst the least restrictive in this class.

(Here $v_{S}$ denotes the restriction of $v$ to $S$ , cf. Section 2.) The restricted eigenvalue constant for $(A,S,\alpha)$ , denoted by ${\sf re}(A,S,\alpha)$ , is the largest value of $R>0$ such that:

Moreover, ${\sf re}(A,s,\alpha)$ is the minimum of ${\sf re}(A,S,\alpha)$ over $S\subset\{1,\dots,p\}$ with $|S|=s$ .

In the setting of (29) one may take $S$ as the support of ${\beta}_{\min}$ . Assuming ${\sf re}(\widehat{\bf X}_{n},S,\alpha)$ is bounded for some specific $\alpha>0$ ensures that estimators such as the Dantzig selector and the LASSO will achieve the near-OLS error rate defined above. Here is one example by Buhlmann and Van der Geer which may be applied to a fixed-design linear regression model

Then there exists a choice of $\lambda=\lambda(\sigma^{2},n,p)$ such that, with probability $\geq 1-p^{-2}$ :

where $c>0$ depends only on ${\sf re}(\widehat{\bf X}_{n},S,3)$ .

We emphasize that this estimator has performance which nearly matches that of OLS when the support of ${\beta}_{\min}$ is known. Similar results could be achieved by trying all potential supports: the merit of the LASSO and related methods is computational efficientcy.

We note in passing that there is also a fairly sizable literature on how well the LASSO and other methods do when ${\beta}_{\min}$ is only approximately sparse and a linear model is not necessarily valid. We will mostly refrain from discussing this in what follows, and refer to for further discussion of this topic.

2 Our result, and related work

Define diagonal matrices $\widehat{D}_{2,n}$ and $D_{2}$ corresponding to the diagonals of $\widehat{\Sigma}_{n}$ and $\Sigma$ (respectively). Set

and ${\bf X}\equiv D_{2}^{-1/2}\Sigma D_{2}^{-1/2}$ with the convention that the $(j,j)$ th entry of $\widehat{D}_{2,n}^{-1/2}$ (resp. $D_{2}^{-1/2}$ ) is zero whenever the corresponding entry of $\widehat{D}_{2,n}$ (resp. $D_{2}$ ) is zero. Assume that $\delta,\varepsilon\in(0,1/2)$ and $S\subset\{1,\dots,p\}$ with cardinality $|S|=s$ , and set

Then the following three properties hold simultaneously with probability $\geq 1-\delta$ .

For any $x$ as above, $x^{T}\widehat{\Sigma}_{n}x\geq(1-\varepsilon)^{2}\,x^{T}\Sigma\,x.$

Let us note the main differences between this theorem and the results in : our theorem holds for a specific choice of $S\subset\{1,\dots,p\}$ – ie. it is not uniform over $S$ with $|S|=s$ – and uses the “normalized” matrix $\widehat{\bf X}_{n}$ instead of $\widehat{\Sigma}_{n}$ . Both differences are related to our moment assumptions, and both turn out not to be problematic in certain scenarios, such as “randomized, RIPless compressed sensing” and statistical regression problems, where one wants to solve one problem instance and uniform guarantees are unnecessary (cf. Section 5.3 below). We note that the normalization on $\widehat{\Sigma}_{n}$ is farly natural, at it ensures the “unit diagonal” condition in Theorem 5.1. We also note that stronger moment assumptions allow for stronger conclusions via the same proof methods; we illustrate this with a simple example.

3 A digression on linear regression with random design

Theorem 5.2 is quite obviously applicable to fixed design regression as in Theorem 5.1. As it turns out, it may also be applied to the random design setting discussed in Section 4.1 when the dimension $p$ is much greater than the number of samples $n$ . For simplicity we will focus on how this is done in the linear model setting (22), in which case we may apply Theorem 5.1 directly; a general model analysis would require ideas from .

with probability $1-O\left(p^{-2}\right)$ , as long as ${\sf re}({\bf X},S,c)>0$ for some $c>3$ and the other parameters are chosen in their proper ranges.

4 Proof ideas, and the transfer principle

Suppose $\widehat{\Sigma}_{n}$ and $\Sigma$ are matrices with non-negative diagonal entries, and assume $\eta\in(0,1)$ , $d\in\{1,\dots,p\}$ are such that

Assume $D$ is a diagonal matrix whose elements $D[j,j]$ are non-negative and satisfy $D[j,j]\geq\widehat{\Sigma}_{n}[j,j]-(1-\eta)\,\Sigma[j,j]$ . Then

Raskutti et al. prove such a bound directly for Gaussian ensembles, and note that it implies the restricted eigenvalue property when the population design matrix has this property. In our case we use Theorem 3.1 to control of $\widehat{\Sigma}_{n}$ over sparse vectors, and combine it with this Lemma to obtain the appropriate control over the cone $\mathcal{C}(S,\alpha)$ . As noted in the introduction, this Transfer Principle implies a version of the main result of Rudelson and Zhou ; see Appendix A for details.

Proof: [of Lemma 5.1] We assume $D$ is invertible; the general case follows via a simple continuity argument. We also set

Notice that $v^{T}Av\geq 0$ for all $d$ -sparse vectors, and also that $0\leq A[j,j]\leq 1$ for each $1\leq j\leq p$ . We will prove that:

which implies the Lemma once we set $y=D^{1/2}x$ .

To prove $(\star)$ we use a probabilistic argument related to Maurey’s empirical method. We may write

where $s_{j}\in\{-1,+1\}$ is the sign of $y[j]$ and $p_{j}\equiv|y[j]|/|y|_{1}$ .

The $p_{j}$ ’s are non-negative and sum to $1$ . Therefore we may define $v_{1},\dots,v_{d}$ to be independent and identically distributed random vectors with the following distribution:

has at most $d$ nonzero coordinates, so $v^{T}Av\geq 0$ . Taking expectations, we see that:

It remains to compute this expectation. When $i\neq r$ , $v_{i}$ and $v_{r}$ are independent and:

When $i=r$ and $v_{i}=|y|_{1}\,s_{j}\,e_{j}$ we see that $v_{i}^{T}Av_{i}=|y|_{1}^{2}A[j,j]\leq|y|_{1}^{2}$ because $A[j,j]\leq 1$ for each $j$ . Thus

Combining the two previous displays with (32) gives

5 Proof

In this section we prove Theorem 5.2. The first step is where we use Theorem 3.1 and Lemma 5.1.

Under the assumptions of Theorem 5.2, the following event holds with probability $\geq 1-\delta/3$ :

by the assumptions of Theorem 5.2. We begin by proving that:

More specifically, this follows from Theorem 3.1 with $d$ replacing $p$ and $8\delta/p^{d}$ replacing $\delta$ . Notice that with these choices

Plugging this into (35) and applying a union bound over $U$ gives:

This gives our first goal (34). To obtain the Lemma from this, we apply the transfer principle (Lemma 5.1) whenever Lower holds, using (33) to deduce

Proof: [of Theorem 5.2] We define Lower as in Lemma 5.2, and consider two other events:

This proves C1 in the Theorem, and also allows us to obtain:

Combining the final bound with Lower and using our assumption on $n$ we obtain

This is C2. Finally, note that ${\sf Diag}_{-}$ implies

Since this holds for any $\widehat{D}_{2,n}^{1/2}x$ as above we may use the substitution $y=\widehat{D}_{2,n}x$ to conclude:

We have proved that C1, C2 and C3 hold in the intersection Lower $\cap$ Diag ${}_{+}\cap$ Diag-. We now estimate the probability of this intersection by showing that each event has probability $\geq 1-\delta/3$ . We already have this lower bound for Lower from Lemma 5.2. For Diag- we will use Lemma B.2 in the appendix. Note that for each $1\leq j\leq p$

Applying Lemma B.2 for each $j$ , with the choice $t=\varepsilon^{2}n/2{\sf h}^{2}$ , gives

by our assumptions on the parameters. $\Box$

Final remarks

LASSO-type estimators like the one described here have been analyzed without restricted eigenvalue assumptions. Bartlett, Mendelson and Neeman prove that the LASSO acts as a penalized least squares regressor satisfying a sharp oracle inequality. While we do not pursue this here, one could prove such a result under weak moment assumptions similar to those of Theorem 4.1, but allowing for $n\gg\log p$ . The recipe would be to start from Lemma 5.2 above and combine the “self-normalization” ideas in the proof of Theorem 5.2 with standard methods for obtaining sharp oracle inequalities . However, in this case the penalty of the LASSO would scale as $O\left(\sqrt{\ln p/n}\,|\widehat{D}_{4,n}^{1/4}|_{1}\right)$ , where $\widehat{D}_{4,n}$ contains $n^{-1}\sum_{i}X_{i}[j]^{4}$ on the diagonal ( $1\leq j\leq p$ ) and zeros elsewhere.

An interesting queston is whether one can use PAC Bayesian methods to impove upon the upper tail in random covariance matrix estimation. It seems that at least some of the constants in references could be improved by this approach. The main idea would be to use self-normalized concentration inequalities to compensate for the lack of infinitely many moments.

Another intersting problem (in the context of Theorem 5.2) is to investigate whether the PAC Bayesian method can improve on the known recovery guarantees for vectors with bounded entries . One may try to achieve this via a different choice of smoothig distribution and it seems likely that one of those choices will improve e.g. the best known bounds for sampling rows of an orthogonal matrix. This would also have some bearing on the performance of the LASSO.

Appendix A An improvement over the result of Rudelson and Zhou

In this appendix we discuss how our Transfer Principle (Lemma 5.1) can be applied to obtain an improvement of a recent result by Rudelson and Zhou . The notation and definitions from Section 5.1 are taken for granted.

The goal of was to show that if $\widehat{\Sigma}_{n}$ “acts like” $\Sigma$ over sparse vectors, it necessarily inherits restricted eigenvalue properties from $\Sigma$ . This is important because dealing directly with the restricted eigenvalues might be complicated, whereas controlling $\widehat{\Sigma}_{n}$ over sparse vectors is typically much easier. Their precise result reads as followsThe reader should beware that our notation and our definition of restricted eigenvalues do not coincide with that of . What follows is a “translation” to our language.:

Then $\forall x\in\mathcal{C}(S,\alpha)\,:\,x^{T}\widehat{\Sigma}_{n}\,x\geq(1-3\varepsilon/2)\,\,x^{T}\Sigma\,x.$ In particular,

The main conceptual difference between these two Theorems is that the condition (38) implies (39) and (40) with $\gamma\approx\varepsilon$ (and a slightly different $\varepsilon$ ). Moreover, we do not require bounds on ${\sf re}(\Sigma,s,3\alpha)$ , and the numerical constants in our result are better. We also note that the full proof of Theorem A.2 (which includes Lemma 5.1 above) is quite simple and about two pages long.

Proof: [of Theorem A.2] By our assumptions, we have

We also have the condition $v^{T}\widehat{\Sigma}_{n}v\geq(1-\varepsilon)\,v^{T}\Sigma v$ for all $d$ -sparse $v$ . We may apply Lemma 5.1 with $D=M\,I_{p\times p}$ to conclude:

We now restrict attention to $x\in\mathcal{C}(S,\alpha)$ , noting that

by the definitions of $M$ and $d$ . Plugging this back into (41) gives:

By the definition of ${\sf re}(\Sigma,S,\alpha)$ , we also have

and apply this to $\xi=3\varepsilon/2$ (which is $\leq 3/4$ because $\varepsilon\leq 1/2$ ). $\Box$

Appendix B Technical estimates

note that we also used (implicitly) the fact that ${e_{j}}^{T}{Ae_{j}}\geq 0$ for all $j$ . $\Box$

B.2 Lower tail concentration for non-negative random variables

Let $W_{1},\dots,W_{n}\in[0,+\infty)$ be independent non-negative random variables with finite second moments. Then:

Apply this to $x=\xi W_{i}$ (with $\xi>0$ ) and integrate to deduce:

Now use the independent of the $W_{i}$ ’s to obtain:

The usual Bernstein’s trick finishes the proof. More specifically, we note that, for any $\lambda>0$

then bound the RHS via (42) and optimize in $\xi$ . $\Box$

B.3 Proof of Proposition 4.1

Proof: [of Proposition 4.1] The key step in the proof is to show that the sequence of random variables $U_{0}\equiv 1$ ,

To prove that $U_{i}$ is indeed a supermartingale, note that:

and the conditional Jensen inequality implies:

Note that $D-D^{\prime}$ is a symmetric random variable. Therefore: