Concentration Inequalities and Moment Bounds for Sample Covariance Operators
Vladimir Koltchinskii, Karim Lounici
Introduction
The nuclear norm is defined as the infimum of the sums (1.2) over all the sequences such that representation (1.1) holds.
Let be i.i.d. copies of The sample (empirical) covariance operator based on the observations is defined as the operator such that
Clearly, this is an operator of rank at most and it is an unbiased estimator of the covariance operator
If are i.i.d. centered random variables with then the sum satisfies the following version of Bernstein’s inequality: for all with probability at least
Our goal is to obtain moment bounds and concentration inequalities for the operator norm It turns out that both the size of the expectation of random variable and its concentration around its mean can be characterized in terms of the operator norm and another parameter defined below.
Assuming that is a centered Gaussian random variable in with covariance operator define
The main results of the paper include the following:
under an assumption that are i.i.d. centered Gaussian random variables in with covariance operator it will be shown that
Moreover, under an additional assumption that the following concentration inequality holds for some constant and for all with probability at least
Under an assumption that the concentration inequality becomes
Main results
The problem of bounding the operator norm has been intensively studied, especially, in the finite-dimensional case (see and references therein). The focus has been on understanding of dependence of this norm on the dimension of the space and on the sample size (that could be simultaneously large) as well as on the tails of linear forms and of the norm of random variable Many results that hold for Gaussian random variables are also true in a slightly more general subgaussian case.
A centered random variable in will be called subgaussian iff, for all
We will also need the following definition.
A weakly square integrable centered random variable in with covariance operator is called pregaussian iff there exists a centered Gaussian random variable in with the same covariance operator
There exists an absolute constant such that, for all with probability at least
The proof of this theorem is based on a simple -net argument that allows one to reduce bounding the operator norm to bounding the finite maximum
where is a -net of the unit sphere of cardinality The bounding of the finite maximum is based on a version of Bernstein inequality for the sum of independent random variables combined with the union bound (see the proof of Theorem 5.39 in and the comments after this theorem).
Another approach to bounding the operator norm was developed by Rudelson and it is based on a noncommutative Khintchine inequality due to Lust-Picard and Pisier . This method can be used not only in subgaussian, but also in “heavy tailed” cases and it leads, for instance, to the following expectation bound (see Vershynin , Theorem 5.48):
In each of the above approaches, the bounds are not dimension free (at least, with a straightforward application of noncommutative Bernstein or Khintchine inequalities) and they could not be directly used in the infinite-dimensional case. We will use below a different approach based on recent deep results on generic chaining bounds for empirical processes. The following facts about generic chaining complexities will be needed. Let and Given a metric space an increasing sequence of partitions of is called admissible if For denotes the unique set of the partition that contains For denotes the diameter of set Define
where the infimum is taken over all admissible sequences.
The following fundamental result is due to Talagrand (see ; it was initially stated it terms of majorizing measures rather than generic complexities).
Let be a centered Gaussian process and suppose that
Then, there exists an absolute constant such that
In what follows, generic chaining complexities are used in the case when is a function class on a probability space and is the metric generated by either -norm, or by the -norm with respect to We will use the following result due to Mendelson (although an earlier, simpler and weaker version, with instead of that goes back to Klartag and Mendelson would suffice for our purposes).
Let be i.i.d. weakly square integrable centered random vectors in with covariance operator If is subgaussian and pregaussian, then
proof. The proof of the upper bound relies on the generic chaining bound of Theorem 3, while the proof of the lower bound is rather elementary.
where {\mathcal{F}}:=\Bigl{\{}\langle\cdot,u\rangle:u\in U_{E^{\ast}}\Bigr{\}}, and is the distribution of random variable
Since is subgaussian, the - and -norms of linear functionals are both equivalent to the -norm. This implies that
Also, since is pregaussian, there exists a centered Gaussian random variable in with the same covariance This means that
Using Talagrand’s Theorem 2, we easily get that
Lower Bound. To prove the lower bound, note that
For a fixed with and denote
By a straightforward computation, for all the random variables and are uncorrelated. Since they are jointly Gaussian, it follows that and are independent. Define
Then and are also independent. We easily get
Note that, conditionally on the distribution of random variable
is Gaussian and it coincides with the distribution of the random variable
are i.i.d. standard normal random variables. It is easy to check that
for a positive numerical constant implying that
We now combine this bound with (2) and (2) to get
for some numerical constant Thus, we get
provided is chosen to be small enough to satisfy
This completes the proof in the case when since in this case
On the other hand, under the assumption that
which completes the proof in the case when
Our next goal is to prove a concentration inequality for around its median or around its expectation. In what follows, denotes a median of a random variable
Let be i.i.d. centered Gaussian random vectors in with covariance and let be either the median, or the expectation of Then, there exists a constant such that the folllowing holds. If then for all with probability at least
On the other hand, if then with the same probability
In the case when is the median, this result is an immediate consequence of Theorem 4 and Theorem 6 that is given below and that provides an equivalent concentration inequality written in a somewhat implicit form. The bounds of Theorem 5 in the case when is the median imply that
when This, in turn, implies the concentration bound in the case when is the expectation.
Let be i.i.d. centered Gaussian random vectors in with covariance and let be the median of Then, there exists a constant such that for all with probability at least
The proof of Theorem 6 is somewhat long and will be given in the next section. Here we will state a couple corollaries of this theorem.
Under the assumptions and notations of Theorem 6, there exists a constant such that, for all with probability at least
proof. The proof easily follows from the next simple bound:
The following corollary can be viewed as an infinite-dimensional generalization of Theorem 1.
Under the assumptions and notations of Theorem 6, there exists a constant such that, for all with probability at least
proof. Bound (2.8) follows immediately from Corollary 1 and Theorem 4. Bound (2.9) follows from (2.8) by integrating the tail probabilities.
Proof of the concentration inequality
In this section, we provide a proof of Theorem 6. We will use the following well known fact (see, e.g., ).
Let be a centered Gaussian random variable in a separable Banach space Then there exists a sequence of vectors in and a sequence of i.i.d. standard normal random variables such that
where the series in the right hand side converges in a.s. and
Note that under the assumptions and notations of Theorem 7,
It easily follows from Theorem 7 that, for we have
Let now be the covariance operator of and be the sample covariance operator based on observations (with the notation having an obvious meaning and the sample size being fixed). Then,
Thus, it is enough to prove the theorem only in the case when
The general case would then follow by a straightforward limiting argument.
The main ingredient of the proof is the classical Gaussian concentration inequality (see, e.g., Ledoux and Talagrand , p. 21).
where is the distribution function of a standard normal random variable.
This result easily follows from the Gaussian isoperimetric inequality. We will also need another consequence of this inequality:
Under the assumptions of Lemma 1, suppose that for some and for some
Then, there exists a constant (possibly depending on ) such that, for all with probability at least
Lemma 1 will be applied to the function We have to check the Lipschitz condition for this function. To this end, we will prove the following lemma.
proof. Obviously, implying that
It is enough to consider the case when or (otherwise, the claim of the lemma is obvious). To be specific, assume that Then, using the assumption that is Lipschitz with constant we get
We will now control Note that
Substituting the last bound in (3.3), we get
In view of (3.2), the left hand side is also bounded from above by which allows one to get from (3.5) that
It is also easy to check that the same bound holds in the opposite case, too. As a consequence, (3.6) implies that with some numerical constant
Combining this with bound (3.7) yields (3.1).
It follows from lemmas 1 and 3 that, for all with probability at least
where is a numerical constant. We will use this bound to get that, on the event where and, at the same time, concentration bound (3.8) holds, we have
There exists a constant such that for all with probability at least
In particular, this implies that, for some constant
and, by Bernstein’s inequality for -random variables, with probability at least
The proof of (3.11) immediately follows by taking
We will define for as follows:
It is easy to see that (provided that constant is chosen to be sufficiently large). Note also that
Thus, by induction, is a nonincreasing sequence. In view of definition of it follows from (3.9) that for all
Define as follows:
where we also used (3.14). Taking into account (3.12) and (3.13), we get that for some constant
and also that, for some constant and for
Using now (3.15) with instead of it is easy to get that with probability at least
where we used the notation In the case when we have
Hence, doubling the value of the constant allows us to drop the two terms involving On the other hand, assume that with a sufficiently large constant (to be determined later). Observe that and we can use a bound for the median similar to (2.3):
for some constants and for We also used the fact that for Gaussian
Since also this implies that with some constant and with the same probability
Take now to be equal to the expression in the right hand side of bound (3) and use this value of to do another iteration of bound (3.9). This easily yields that with some constant and with probability at least
To complete the proof of concentration inequality (2.6), note that, for an arbitrary on the event where (3.8) holds and also
(provided that ). Note also that
Then, it follows from Lemma 2 that, for a sufficiently large constant and for all with probability at least the following bound holds:
Recall also that on the event where of probability at least Therefore, with probability at least
The result now follows by substituting given by (3.19) into bound (3.20), doing simple algebra and adjusting the value of constant to get the probability bound
Very recent exponential generic chaining bounds for empirical processes by Dirksen (see Corollary 5.7) and by Bednorz (see Theorem 1) imply the following (earlier, Mendelson , Theorem 3.1 obtained another version of exponential generic chaining bounds for the same class of processes).
Let be i.i.d. random variables in a measurable space with common distribution and let be a class of measurable functions on There exists a constant such that for all with probability at least
This result together with the argument used in the proof of the upper bound of Theorem 4 easily implies the following generalization of Corollary 2.
Let be i.i.d. weakly square integrable centered random vectors in with covariance operator If is subgaussian and pregaussian, then there exists a constant such that, for all with probability at least
Note that the proof of concentration inequality of Theorem 6 does not rely on generic chaining bounds, it relies only on the Gaussian isoperimetric inequality. The bound of Theorem 9 (based on the generic chaining method) could be used to provide a shortcut in the proof of the concentration inequality. To this end, instead of using very rough initial bound based on Lemma 4 one should use much more precise bound of Theorem 9. In this case, there is no need to implement an iterative argument improving the bound, the concentration inequality in its explicit form (Theorem 5) follows just by an application of the Gaussian isoperimetric inequality. Adamczak suggested an alternative approach to the proof of Theorem 5. It is based on a version of a concentration inequality for Gaussian chaos and on some other tools (such as Gordon-Chevet inequality), but it does not rely on the generic chaining bounds.
Acknowledgments. The authors are very thankful to Sjoerd Dirksen for attracting their attention to paper . Radek Adamczak pointed out that a similar result was proved in .
The authors are especially thankful to Radek Adamczak for providing an alternative proof of the concentration inequality and for very helpful discussions. The initial version of Theorem 5 was under an extra assumption that We improved our argument after Adamczak had provided his alternative proof.