The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties
Roberto Imbuzeiro Oliveira
Introduction
The most basic problem is computing how many samples are needed to bring close to . One needs at least to bring close to , so that the ranks of the two matrices can match. A basic problem is to find conditons under which samples are enough for guaranteeing
where depends only on and on moment assumptions on the ’s.
A well known bound by Rudelson implies samples are necessary and sufficient if the vectors have uniformly bounded norms. Removing the factor is relatively easy for subgaussian vectors , but even the seemingly nice case of logconcave random vectors (which have subexponential moments) had to wait for the breakthrough papers by Adamczak et al . The current best results hold when the and all of their projections have moments , and when their one-dimensional marginals have moments ; in the latter case one also needs (necessarily) a high probability bound on . None of those finite-moment results gives strong concentration bounds.
It turns out that, for many important applications, only the lower tail of matters. That is, we only need that is not much smaller than for all vectors in a suitable set. Our main result in this paper is that this lower tail is subgaussian under extremely weak conditions. More precisely, we will prove that if there exists a such that
then samples are enough to guarantee an asymmetric version of (2), to wit:
This follows from a more precise result – Theorem 3.1 in Section 3 below – about the more general case of sums of independent and identically distributed positive semidefinite random matrices. We note that the dependence on in our bound is optimal for vectors with independent coordinates, as can be shown via the Bai-Yin theorem .
Let us briefly comment on some proof ideas we think might be useful elsewhere. Theorem 3.1, our main result, is proven via so called PAC Bayesian methods and is inspired by the recent paper by Audibert and Catoni . We will see that this method allows one to translate properties of moment generating functions of individual random variables into uniform control of certain empirical processes. This is discussed in more detail in Section 3.2.
Organization: The next section covers some preliminaries and defines the notation we use. Section 3 contains the statement and proof of the main result, Theorem 3.1, along with a discussion of the assumptions and a proof overview. Section 4 presents our result on ordinary least squares, giving some background for the problem. Section 5 follows a similar format for restricted eigenvalues. The final section presents some remarks and open problems. Two Appendices contain a discussion of our improvement over , and some estimates used in the main text.
Notation and preliminaries
The restriction of to a subset is the vector with for and for .
We use asymptotic notation somewhat informally, in order to illustrate our results with clean statements. We write or to indicate that is very small, and to say that is bounded by a universal constant.
Finally, we state for later use the Burkholder-Davis-Gundy inequality. Let denote a martingale with finite -th moments () and . Then:
Note that the first inequality above is the BDG inequality with optimal constant, and the second inequality follows from Minkowski’s inequality for the norm. We also observe that (7) implies a result for which are i.i.d. random variables:
Better inequalities are known in this case, but we will use (8) for simplicity.
The subgaussian lower tail
The goal of this section is to discuss and prove our main result.
Let us recall that in the vector case the main assumption we need is that
An obvious case where (9) holds is when are independent, have finite fourth moments and mean . A short calculation shows that we may take
Significantly, the same calculations also work when are four-wise independent; this will be interesting when considering compressed sensing-type applications (cf. Example 1 below). Changing to allows us to consider translations and linear transformations of .
These particular cases include many important examples, such as gaussian, subgaussian, logconcave vectors and their affine transformations. There are also many examples with unbounded moments. If we multiply by an independent scalar with
we just need to replace with . Interestingly, the upper tail of is quite sensitive to this kind of transformation. Even multiplying by a Gaussian random variable may result in an ensemble that does not obey the analogue of the main theorem (cf. the discussion in [30, Section 1.8]).
2 Proof overview and a preliminary PAC Bayesian result
is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have subgaussian lower tails under weak assumptions; see eg. Lemma B.2 below.
To make these ideas more definite we present a technical result that encapsulates the main ideas in our PAC Bayesian approach. This requires some conditions.
are well defined and depend continuously on . We will use the notation to denote the integral of (which may also depend on other parameteres) over the variable with the measure .
In the next subsection we will apply this to prove Theorem 3.1. Here is a brief overview: we will performe a change of cordinates under which . We will then define as
is a new term introduced by the“smoothing operator” . The choice will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator to ensure that is highly concentrated.
Proof: [of Proposition 3.1] As a preliminary step, we note that under our assumptions the map:
is measurable. This implies that the event in the statement of the proposition is indeed a measurable set.
To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space :
But this follows from Markov’s inequality and Fubini’s Theorem:
3 Proof of the main result
The goal of our proof is to show that, for any :
Replacing with above and using homogeneity reduces this goal to showing:
This is what we will show in the remainder of the proof.
Fix some and define (with hindsight) truncated operators
with the convention that this is simply if . We collect some estimates for later use.
Fix . We will apply Proposition 3.1 with and
plus the fact that, for any non-negative, square-integrable random variable ,
(this is shown in the proof of Lemma B.2 in the Appendix). We deduce from Proposition 3.1 that, with probability ,
The first two terms inside the brackets are non-negative and, by Cauchy Schwartz, the absolute value of the rightmost term is at most the sum of the other two. We deduce:
Taking expectations, applying Lemma 3.1 and recalling gives:
This holds for any . Optimizing over gives:
The overestimates , and finish the proof of (15). This in turn finishes the proof of Theorem 3.1 except for Lemma 3.1, which is provn below.
Proof: [of Lemma 3.1] The first item is immediate. The third item follows from and Lemma B.1 in Section B.1.
Combining the last three inequalities finishes the proof.
It is instructive to compare this proof with what one would obtain without truncation. In that case everything would go through except for the step where we apply Bennett’s inequality.
Ordinary least squares under random design
as small as possible. In other words, one is trying to find a linear combination of the coordinates of that is as close as possible to in terms of mean-square error. The random design setting should be contrasted with the technically simpler case of fixed design, where the ’s are assumed fixed and all randomness is in the ’s. Results about this setting are not indicative about out-of-sample prediction, a crucial property in many tasks where least squares is routinely used, as well as in theoretical problems such as linear aggregration; see for further discussion.
This estimator is not hard to study when is large, is much smaller than and a linear model is assumed:
Here we want to consider a completely model-free, non-parametric setting where no specific relationship between and is assumed. Moreover, we want to allow for large , with the only condition is that should be small. This rules out using classical asymptotic theory (which is not quantitative) as well as Barry-Esséen-type bounds (which do not work for ; see for the best known bounds).
2 Our result, and previous work
Choose and assume:
The proof of Theorem 4.1 consists of three steps. One is to use an explicit expression for OLS in order to express . Theorem 3.1 is used to prove that a matrix that appears in the expression for this difference has bounded norm. The third step is to control the remaining expression, which is a sum of i.i.d. random vectors that we analyze via Lemma 4.1 below.
3 The proof
Proof: [of Theorem 4.1] We will assume that has full rank; the general case follows from a simple perturbation argument. We also define as in (1), that is,
The assumptions on of Theorem 4.1 imply those of Theorem 3.1 (with ). Tis implies that the event
The are independent vectors whose law is the same as that of in Theorem 4.1. This implies that the following Lemma may be applied.
The Lemma implies that the event Vector defined below,
The two rightmost terms in the previous display are bounded via Lower and Vector. To see this we begin by applying (6) above:
Plugging these bounds into (27) results in
and this inequality holds whenever occurs. In particular, the probability of the last display satisfies the bound claimed in the Theorem.
4 Proof of the auxiliary result on sums of random vectors
Proof: [of Lemma 4.1] Write and , . We note that:
is a martingale with respect to the filtration ,
Now define if , and otherwise. The following random variable will be important later on.
We will use the following estimates (proven subsequently).
We will also use the following simple fact about martingales, which we prove in the appendix
Suppose is a square-integrable martingale with respect to a filtration . Define and
Combining this with Claim 1 and the definition of shows that, for any choice of , , we have:
Now fix some , make the choice of
and apply (28) to the value achieving the maximum of . We have that, with probability ,
with probability This is precisely the desired result once we choose:
Proof: [of Claim 1] We prove the second (harder) assertion first. Note that
is a martingale with respect to the filtration . The Burkholder-Davis-Gundy inequality (7) implies, for any ,
Using again that always, . We deduce:
Restricted eigenvalues in high dimensions
where represent some kind of noise and – most importantly – the dimension may greatly exceed the number of measurements. The aforementioned fields tend to interpret this setup in different ways. Whereas in Compressed Sensing one tends to think of the ’s as measurement vectors as controlled by the “experimenter”, for a statistician the and are generated by a random process that is not under control (and the whole problem corresponds to linear regression ; Section 5.3 below).
It should be clear that, given , the above problem is severely underdetermined. However, sparsity may be used as a key enabling assumption. It is known that if the vector has non-zero coordinates, then it may be recovered up to error of the order . This is only times larger than the error of OLS which “knows” the support of . Most importantly, there are computationally efficient estimators achieving this rate. These developments and their extensions comprise a vast literature which we will not try to survey; we refer instead to a recent book and a handful of important papers for more information on these topics.
Computationally efficient estimators achieving this rate require certain conditions besides sparsity. Denote by the design matrix:
Several sufficient conditions on are known to ensure the fast rates we have described, including uniform uncertainty principles, restricted isometry, sparse eigenvalues and incoherence; see eg. and especially the paper where these conditions are compared. In this paper we focus on so-called restricted eigenvalue conditions, which are amongst the least restrictive in this class.
(Here denotes the restriction of to , cf. Section 2.) The restricted eigenvalue constant for , denoted by , is the largest value of such that:
Moreover, is the minimum of over with .
In the setting of (29) one may take as the support of . Assuming is bounded for some specific ensures that estimators such as the Dantzig selector and the LASSO will achieve the near-OLS error rate defined above. Here is one example by Buhlmann and Van der Geer which may be applied to a fixed-design linear regression model
Then there exists a choice of such that, with probability :
where depends only on .
We emphasize that this estimator has performance which nearly matches that of OLS when the support of is known. Similar results could be achieved by trying all potential supports: the merit of the LASSO and related methods is computational efficientcy.
We note in passing that there is also a fairly sizable literature on how well the LASSO and other methods do when is only approximately sparse and a linear model is not necessarily valid. We will mostly refrain from discussing this in what follows, and refer to for further discussion of this topic.
2 Our result, and related work
Define diagonal matrices and corresponding to the diagonals of and (respectively). Set
and with the convention that the th entry of (resp. ) is zero whenever the corresponding entry of (resp. ) is zero. Assume that and with cardinality , and set
Then the following three properties hold simultaneously with probability .
For any as above,
Let us note the main differences between this theorem and the results in : our theorem holds for a specific choice of – ie. it is not uniform over with – and uses the “normalized” matrix instead of . Both differences are related to our moment assumptions, and both turn out not to be problematic in certain scenarios, such as “randomized, RIPless compressed sensing” and statistical regression problems, where one wants to solve one problem instance and uniform guarantees are unnecessary (cf. Section 5.3 below). We note that the normalization on is farly natural, at it ensures the “unit diagonal” condition in Theorem 5.1. We also note that stronger moment assumptions allow for stronger conclusions via the same proof methods; we illustrate this with a simple example.
3 A digression on linear regression with random design
Theorem 5.2 is quite obviously applicable to fixed design regression as in Theorem 5.1. As it turns out, it may also be applied to the random design setting discussed in Section 4.1 when the dimension is much greater than the number of samples . For simplicity we will focus on how this is done in the linear model setting (22), in which case we may apply Theorem 5.1 directly; a general model analysis would require ideas from .
with probability , as long as for some and the other parameters are chosen in their proper ranges.
4 Proof ideas, and the transfer principle
Suppose and are matrices with non-negative diagonal entries, and assume , are such that
Assume is a diagonal matrix whose elements are non-negative and satisfy . Then
Raskutti et al. prove such a bound directly for Gaussian ensembles, and note that it implies the restricted eigenvalue property when the population design matrix has this property. In our case we use Theorem 3.1 to control of over sparse vectors, and combine it with this Lemma to obtain the appropriate control over the cone . As noted in the introduction, this Transfer Principle implies a version of the main result of Rudelson and Zhou ; see Appendix A for details.
Proof: [of Lemma 5.1] We assume is invertible; the general case follows via a simple continuity argument. We also set
Notice that for all -sparse vectors, and also that for each . We will prove that:
which implies the Lemma once we set .
To prove we use a probabilistic argument related to Maurey’s empirical method. We may write
where is the sign of and .
The ’s are non-negative and sum to . Therefore we may define to be independent and identically distributed random vectors with the following distribution:
has at most nonzero coordinates, so . Taking expectations, we see that:
It remains to compute this expectation. When , and are independent and:
When and we see that because for each . Thus
Combining the two previous displays with (32) gives
5 Proof
In this section we prove Theorem 5.2. The first step is where we use Theorem 3.1 and Lemma 5.1.
Under the assumptions of Theorem 5.2, the following event holds with probability :
by the assumptions of Theorem 5.2. We begin by proving that:
More specifically, this follows from Theorem 3.1 with replacing and replacing . Notice that with these choices
Plugging this into (35) and applying a union bound over gives:
This gives our first goal (34). To obtain the Lemma from this, we apply the transfer principle (Lemma 5.1) whenever Lower holds, using (33) to deduce
Proof: [of Theorem 5.2] We define Lower as in Lemma 5.2, and consider two other events:
This proves C1 in the Theorem, and also allows us to obtain:
Combining the final bound with Lower and using our assumption on we obtain
This is C2. Finally, note that implies
Since this holds for any as above we may use the substitution to conclude:
We have proved that C1, C2 and C3 hold in the intersection LowerDiagDiag-. We now estimate the probability of this intersection by showing that each event has probability . We already have this lower bound for Lower from Lemma 5.2. For Diag- we will use Lemma B.2 in the appendix. Note that for each
Applying Lemma B.2 for each , with the choice , gives
by our assumptions on the parameters.
Final remarks
LASSO-type estimators like the one described here have been analyzed without restricted eigenvalue assumptions. Bartlett, Mendelson and Neeman prove that the LASSO acts as a penalized least squares regressor satisfying a sharp oracle inequality. While we do not pursue this here, one could prove such a result under weak moment assumptions similar to those of Theorem 4.1, but allowing for . The recipe would be to start from Lemma 5.2 above and combine the “self-normalization” ideas in the proof of Theorem 5.2 with standard methods for obtaining sharp oracle inequalities . However, in this case the penalty of the LASSO would scale as , where contains on the diagonal () and zeros elsewhere.
An interesting queston is whether one can use PAC Bayesian methods to impove upon the upper tail in random covariance matrix estimation. It seems that at least some of the constants in references could be improved by this approach. The main idea would be to use self-normalized concentration inequalities to compensate for the lack of infinitely many moments.
Another intersting problem (in the context of Theorem 5.2) is to investigate whether the PAC Bayesian method can improve on the known recovery guarantees for vectors with bounded entries . One may try to achieve this via a different choice of smoothig distribution and it seems likely that one of those choices will improve e.g. the best known bounds for sampling rows of an orthogonal matrix. This would also have some bearing on the performance of the LASSO.
Appendix A An improvement over the result of Rudelson and Zhou
In this appendix we discuss how our Transfer Principle (Lemma 5.1) can be applied to obtain an improvement of a recent result by Rudelson and Zhou . The notation and definitions from Section 5.1 are taken for granted.
The goal of was to show that if “acts like” over sparse vectors, it necessarily inherits restricted eigenvalue properties from . This is important because dealing directly with the restricted eigenvalues might be complicated, whereas controlling over sparse vectors is typically much easier. Their precise result reads as followsThe reader should beware that our notation and our definition of restricted eigenvalues do not coincide with that of . What follows is a “translation” to our language.:
Then In particular,
The main conceptual difference between these two Theorems is that the condition (38) implies (39) and (40) with (and a slightly different ). Moreover, we do not require bounds on , and the numerical constants in our result are better. We also note that the full proof of Theorem A.2 (which includes Lemma 5.1 above) is quite simple and about two pages long.
Proof: [of Theorem A.2] By our assumptions, we have
We also have the condition for all -sparse . We may apply Lemma 5.1 with to conclude:
We now restrict attention to , noting that
by the definitions of and . Plugging this back into (41) gives:
By the definition of , we also have
and apply this to (which is because ).
Appendix B Technical estimates
note that we also used (implicitly) the fact that for all .
B.2 Lower tail concentration for non-negative random variables
Let be independent non-negative random variables with finite second moments. Then:
Apply this to (with ) and integrate to deduce:
Now use the independent of the ’s to obtain:
The usual Bernstein’s trick finishes the proof. More specifically, we note that, for any
then bound the RHS via (42) and optimize in .
B.3 Proof of Proposition 4.1
Proof: [of Proposition 4.1] The key step in the proof is to show that the sequence of random variables ,
To prove that is indeed a supermartingale, note that:
and the conditional Jensen inequality implies:
Note that is a symmetric random variable. Therefore: