Concentration Inequalities and Moment Bounds for Sample Covariance Operators

Vladimir Koltchinskii, Karim Lounici

Introduction

The nuclear norm $\|A\|_{1}$ is defined as the infimum of the sums (1.2) over all the sequences $\{x_{n}:n\geq 1\}\subset E_{1}^{\ast},\ \{y_{n}:n\geq 1\}\subset E_{2}$ such that representation (1.1) holds.

Let $X_{1},\dots,X_{n}$ be i.i.d. copies of $X.$ The sample (empirical) covariance operator based on the observations $(X_{1},\dots,X_{n})$ is defined as the operator $\hat{\Sigma}:E^{\ast}\mapsto E$ such that

Clearly, this is an operator of rank at most $n$ and it is an unbiased estimator of the covariance operator $\Sigma.$

If $\xi,\xi_{1},\dots,\xi_{n}$ are i.i.d. centered random variables with $\|\xi\|_{\psi_{1}}<+\infty,$ then the sum $\xi_{1}+\dots+\xi_{n}$ satisfies the following version of Bernstein’s inequality: for all $t\geq 0$ with probability at least $1-e^{-t}$

Our goal is to obtain moment bounds and concentration inequalities for the operator norm $\|\hat{\Sigma}-\Sigma\|.$ It turns out that both the size of the expectation of random variable $\|\hat{\Sigma}-\Sigma\|$ and its concentration around its mean can be characterized in terms of the operator norm $\|\Sigma\|$ and another parameter defined below.

Assuming that $X$ is a centered Gaussian random variable in $E$ with covariance operator $\Sigma,$ define

The main results of the paper include the following:

under an assumption that $X,X_{1},\dots,X_{n}$ are i.i.d. centered Gaussian random variables in $E$ with covariance operator $\Sigma,$ it will be shown that

Moreover, under an additional assumption that ${\bf r}(\Sigma)\lesssim n,$ the following concentration inequality holds for some constant $C>0$ and for all $t\geq 1$ with probability at least $1-e^{-t}:$

Under an assumption that ${\bf r}(\Sigma)\gtrsim n,$ the concentration inequality becomes

Main results

The problem of bounding the operator norm $\|\hat{\Sigma}-\Sigma\|$ has been intensively studied, especially, in the finite-dimensional case (see and references therein). The focus has been on understanding of dependence of this norm on the dimension of the space and on the sample size $n$ (that could be simultaneously large) as well as on the tails of linear forms $\langle X,u\rangle,u\in E$ and of the norm $\|X\|$ of random variable $X.$ Many results that hold for Gaussian random variables are also true in a slightly more general subgaussian case.

A centered random variable $X$ in $E$ will be called subgaussian iff, for all $u\in E^{\ast},$

We will also need the following definition.

A weakly square integrable centered random variable $X$ in $E$ with covariance operator $\Sigma$ is called pregaussian iff there exists a centered Gaussian random variable $Y$ in $E$ with the same covariance operator $\Sigma.$

There exists an absolute constant $C>0$ such that, for all $t\geq 1,$ with probability at least $1-e^{-t}$

The proof of this theorem is based on a simple $\varepsilon$ -net argument that allows one to reduce bounding the operator norm $\|\hat{\Sigma}-\Sigma\|$ to bounding the finite maximum

where $M\subset S^{d-1}$ is a $1/4$ -net of the unit sphere of cardinality ${\rm card}(M)\leq 9^{d}.$ The bounding of the finite maximum is based on a version of Bernstein inequality for the sum of independent $\psi_{1}$ random variables $\langle X_{j},u\rangle^{2}$ combined with the union bound (see the proof of Theorem 5.39 in and the comments after this theorem).

Another approach to bounding the operator norm $\|\hat{\Sigma}-\Sigma\|$ was developed by Rudelson and it is based on a noncommutative Khintchine inequality due to Lust-Picard and Pisier . This method can be used not only in subgaussian, but also in “heavy tailed” cases and it leads, for instance, to the following expectation bound (see Vershynin , Theorem 5.48):

In each of the above approaches, the bounds are not dimension free (at least, with a straightforward application of noncommutative Bernstein or Khintchine inequalities) and they could not be directly used in the infinite-dimensional case. We will use below a different approach based on recent deep results on generic chaining bounds for empirical processes. The following facts about generic chaining complexities will be needed. Let $N_{n}:=2^{2^{n}},n\geq 1$ and $N_{0}:=1.$ Given a metric space $(T,d),$ an increasing sequence $\Delta_{n}$ of partitions of $T$ is called admissible if ${\rm card}(\Delta_{n})\leq N_{n}.$ For $t\in T,$ $\Delta_{n}(t)$ denotes the unique set of the partition $\Delta_{n}$ that contains $t.$ For $A\subset T,$ $D(A)$ denotes the diameter of set $A.$ Define

where the infimum is taken over all admissible sequences.

The following fundamental result is due to Talagrand (see ; it was initially stated it terms of majorizing measures rather than generic complexities).

Let $X(t),t\in T$ be a centered Gaussian process and suppose that

Then, there exists an absolute constant $K>0$ such that

In what follows, generic chaining complexities are used in the case when $T={\mathcal{F}}$ is a function class on a probability space $(S,{\mathcal{A}},P)$ and $d$ is the metric generated by either $L_{2}(P)$ -norm, or by the $\psi_{2}$ -norm with respect to $P.$ We will use the following result due to Mendelson (although an earlier, simpler and weaker version, with $\sup_{f\in{\cal F}}\|f\|_{\psi_{2}}$ instead of $\sup_{f\in{\cal F}}\|f\|_{\psi_{1}},$ that goes back to Klartag and Mendelson would suffice for our purposes).

Let $X,X_{1},\ldots,X_{n}$ be i.i.d. weakly square integrable centered random vectors in $E$ with covariance operator $\Sigma.$ If $X$ is subgaussian and pregaussian, then

proof. The proof of the upper bound relies on the generic chaining bound of Theorem 3, while the proof of the lower bound is rather elementary.

where ${\mathcal{F}}:=\Bigl{\{}\langle\cdot,u\rangle:u\in U_{E^{\ast}}\Bigr{\}},$ $U_{E^{\ast}}:=\{u\in E^{\ast}:\|u\|\leq 1\}$ and $P$ is the distribution of random variable $X.$

Since $X$ is subgaussian, the $\psi_{1}$ - and $\psi_{2}$ -norms of linear functionals $\langle X,u\rangle$ are both equivalent to the $L_{2}$ -norm. This implies that

Also, since $X$ is pregaussian, there exists a centered Gaussian random variable $Y$ in $E$ with the same covariance $\Sigma.$ This means that

Using Talagrand’s Theorem 2, we easily get that

Lower Bound. To prove the lower bound, note that

For a fixed $u\in E^{\ast}$ with $\|u\|\leq 1$ and $\langle\Sigma u,u\rangle>0,$ denote

By a straightforward computation, for all $v\in E^{\ast},$ the random variables $\langle X,u\rangle$ and $\langle X^{\prime},v\rangle$ are uncorrelated. Since they are jointly Gaussian, it follows that $\langle X,u\rangle$ and $X^{\prime}$ are independent. Define

Then $\{X_{j}^{\prime}:j=1,\dots,n\}$ and $\{\langle X_{j},u\rangle:j=1,\dots,n\}$ are also independent. We easily get

Note that, conditionally on $\langle X_{j},u\rangle,j=1,\dots,n,$ the distribution of random variable

is Gaussian and it coincides with the distribution of the random variable

are i.i.d. standard normal random variables. It is easy to check that

for a positive numerical constant $c_{2},$ implying that

We now combine this bound with (2) and (2) to get

for some numerical constant $c_{3}>0.$ Thus, we get

provided $c_{2}$ is chosen to be small enough to satisfy $c_{2}\sqrt{\frac{2}{\pi}}\leq c_{3}.$

This completes the proof in the case when ${\bf r}(\Sigma)\leq 2n$ since in this case

On the other hand, under the assumption that ${\bf r}(\Sigma)\geq 2n,$

which completes the proof in the case when ${\bf r}(\Sigma)\geq 2n.$

Our next goal is to prove a concentration inequality for $\|\hat{\Sigma}-\Sigma\|$ around its median or around its expectation. In what follows, ${\rm Med}(\xi)$ denotes a median of a random variable $\xi.$

Let $X,X_{1},\ldots,X_{n}$ be i.i.d. centered Gaussian random vectors in $E$ with covariance $\Sigma$ and let $M$ be either the median, or the expectation of $\|\hat{\Sigma}-\Sigma\|.$ Then, there exists a constant $C>0$ such that the folllowing holds. If ${\bf r}(\Sigma)\leq n,$ then for all $t\geq 1,$ with probability at least $1-e^{-t},$

On the other hand, if ${\bf r}(\Sigma)\geq n,$ then with the same probability

In the case when $M$ is the median, this result is an immediate consequence of Theorem 4 and Theorem 6 that is given below and that provides an equivalent concentration inequality written in a somewhat implicit form. The bounds of Theorem 5 in the case when $M$ is the median imply that

when ${\bf r}(\Sigma)\geq n.$ This, in turn, implies the concentration bound in the case when $M$ is the expectation.

Let $X,X_{1},\ldots,X_{n}$ be i.i.d. centered Gaussian random vectors in $E$ with covariance $\Sigma$ and let $M$ be the median of $\|\hat{\Sigma}-\Sigma\|.$ Then, there exists a constant $C>0$ such that for all $t\geq 1$ with probability at least $1-e^{-t},$

The proof of Theorem 6 is somewhat long and will be given in the next section. Here we will state a couple corollaries of this theorem.

Under the assumptions and notations of Theorem 6, there exists a constant $C>0$ such that, for all $t\geq 1,$ with probability at least $1-e^{-t},$

proof. The proof easily follows from the next simple bound: $2\|\Sigma\|^{1/2}M^{1/2}\sqrt{\frac{t}{n}}\leq M+\|\Sigma\|\frac{t}{n}.$

The following corollary can be viewed as an infinite-dimensional generalization of Theorem 1.

Under the assumptions and notations of Theorem 6, there exists a constant $C>0$ such that, for all $t\geq 1,$ with probability at least $1-e^{-t},$

proof. Bound (2.8) follows immediately from Corollary 1 and Theorem 4. Bound (2.9) follows from (2.8) by integrating the tail probabilities.

Proof of the concentration inequality

In this section, we provide a proof of Theorem 6. We will use the following well known fact (see, e.g., ).

Let $X$ be a centered Gaussian random variable in a separable Banach space $E.$ Then there exists a sequence $\{x_{k}:k\geq 1\}$ of vectors in $E$ and a sequence $\{Z_{k}:k\geq 1\}$ of i.i.d. standard normal random variables such that

where the series in the right hand side converges in $E$ a.s. and

Note that under the assumptions and notations of Theorem 7,

It easily follows from Theorem 7 that, for $X^{(m)}:=\sum_{k=1}^{m}Z_{k}x_{k},$ we have

Let now $\Sigma^{(m)}$ be the covariance operator of $X^{(m)}$ and $\hat{\Sigma}^{(m)}$ be the sample covariance operator based on observations $(X_{1}^{(m)},\dots,X_{n}^{(m)})$ (with the notation $X_{j}^{(m)}$ having an obvious meaning and the sample size $n$ being fixed). Then,

Thus, it is enough to prove the theorem only in the case when

The general case would then follow by a straightforward limiting argument.

The main ingredient of the proof is the classical Gaussian concentration inequality (see, e.g., Ledoux and Talagrand , p. 21).

where $\Phi$ is the distribution function of a standard normal random variable.

This result easily follows from the Gaussian isoperimetric inequality. We will also need another consequence of this inequality:

Under the assumptions of Lemma 1, suppose that for some $M$ and for some $\alpha>0$

Then, there exists a constant $D>0$ (possibly depending on $\alpha$ ) such that, for all $t\geq 1,$ with probability at least $1-e^{-t},$

Lemma 1 will be applied to the function $f(Z)=g(X_{1},\dots,X_{n}).$ We have to check the Lipschitz condition for this function. To this end, we will prove the following lemma.

proof. Obviously, $0\leq g(X_{1},\dots,X_{n})\leq 2\delta,$ $0\leq g(X_{1}^{\prime},\dots,X_{n}^{\prime})\leq 2\delta,$ implying that

It is enough to consider the case when $\|W\|\leq 2\delta$ or $\|W^{\prime}\|\leq 2\delta$ (otherwise, the claim of the lemma is obvious). To be specific, assume that $\|W\|\leq 2\delta.$ Then, using the assumption that $\varphi$ is Lipschitz with constant $1,$ we get

We will now control $\|W-W^{\prime}\|.$ Note that

Substituting the last bound in (3.3), we get

In view of (3.2), the left hand side is also bounded from above by $2\delta,$ which allows one to get from (3.5) that

It is also easy to check that the same bound holds in the opposite case, too. As a consequence, (3.6) implies that with some numerical constant $D>0,$

Combining this with bound (3.7) yields (3.1).

It follows from lemmas 1 and 3 that, for all $t\geq 1$ with probability at least $1-e^{-t},$

where $D_{1}$ is a numerical constant. We will use this bound to get that, on the event where $\|W\|\leq\delta$ and, at the same time, concentration bound (3.8) holds, we have

There exists a constant $D_{2}>0$ such that for all $t>0,$ with probability at least $1-e^{-t}$

In particular, this implies that, for some constant $D_{2}>0,$

and, by Bernstein’s inequality for $\psi_{1}$ -random variables, with probability at least $1-e^{-t}$

The proof of (3.11) immediately follows by taking $t=\log 2.$

We will define $\delta_{k}$ for $k\geq 1$ as follows:

It is easy to see that $\delta_{1}\leq\delta_{0}$ (provided that constant $D_{2}$ is chosen to be sufficiently large). Note also that

Thus, by induction, $\delta_{k},k\geq 0$ is a nonincreasing sequence. In view of definition of $\delta_{k},$ it follows from (3.9) that for all $k\geq 1$

Define $u_{k},k\geq 0$ as follows: $u_{0}=\delta_{0},$

where we also used (3.14). Taking into account (3.12) and (3.13), we get that for some constant $D_{3}>0$

and also that, for some constant $c_{1}>0$ and for $t\geq 1$

Using now (3.15) with $t+\log(\bar{k}+1)$ instead of $t,$ it is easy to get that with probability at least $1-e^{-t}$

where we used the notation $\log^{}x:=\log\log\log x.$ In the case when ${\bf r}(\Sigma)\lesssim n,$ we have

Hence, doubling the value of the constant $c_{1}$ allows us to drop the two terms involving $\frac{\log^{}(c_{1}{\bf r}(\Sigma))}{n}.$ On the other hand, assume that ${\bf r}(\Sigma)\geq C^{\prime}n$ with a sufficiently large constant $C^{\prime}$ (to be determined later). Observe that $\log^{}(c_{1}{\bf r}(\Sigma))\lesssim{\bf r}(\Sigma)$ and we can use a bound for the median $M$ similar to (2.3):

for some constants $c^{\prime}>0$ and for $C^{\prime}\geq 2/c^{\prime}.$ We also used the fact that for Gaussian $X$

Since also $\frac{\log^{}(c_{1}n)}{n}\lesssim 1,$ this implies that with some constant $C_{1}$ and with the same probability

Take now $\delta$ to be equal to the expression in the right hand side of bound (3) and use this value of $\delta$ to do another iteration of bound (3.9). This easily yields that with some constant $C>0$ and with probability at least $1-2e^{-t}$

To complete the proof of concentration inequality (2.6), note that, for an arbitrary $\delta>0,$ on the event where (3.8) holds and also $\|W\|\leq\delta,$

(provided that $t\geq 1$ ). Note also that

Then, it follows from Lemma 2 that, for a sufficiently large constant $D_{1}$ and for all $t\geq 1,$ with probability at least $1-e^{-t},$ the following bound holds:

Recall also that $g(X_{1},\dots,X_{n})=\|W\|$ on the event where $\|W\|\leq\delta$ of probability at least $1-2e^{-t-2n}\geq 1-e^{-t}.$ Therefore, with probability at least $1-3e^{-t},$

The result now follows by substituting $\delta$ given by (3.19) into bound (3.20), doing simple algebra and adjusting the value of constant $D_{1}$ to get the probability bound $1-e^{-t}.$

Very recent exponential generic chaining bounds for empirical processes by Dirksen (see Corollary 5.7) and by Bednorz (see Theorem 1) imply the following (earlier, Mendelson , Theorem 3.1 obtained another version of exponential generic chaining bounds for the same class of processes).

Let $X,X_{1},\dots,X_{n}$ be i.i.d. random variables in a measurable space $(S,{\mathcal{A}})$ with common distribution $P$ and let ${\cal F}$ be a class of measurable functions on $(S,{\mathcal{A}}).$ There exists a constant $C>0$ such that for all $t\geq 1$ with probability at least $1-e^{-t}$

This result together with the argument used in the proof of the upper bound of Theorem 4 easily implies the following generalization of Corollary 2.

Let $X,X_{1},\ldots,X_{n}$ be i.i.d. weakly square integrable centered random vectors in $E$ with covariance operator $\Sigma.$ If $X$ is subgaussian and pregaussian, then there exists a constant $C>0$ such that, for all $t\geq 1,$ with probability at least $1-e^{-t},$

Note that the proof of concentration inequality of Theorem 6 does not rely on generic chaining bounds, it relies only on the Gaussian isoperimetric inequality. The bound of Theorem 9 (based on the generic chaining method) could be used to provide a shortcut in the proof of the concentration inequality. To this end, instead of using very rough initial bound $\delta_{0}$ based on Lemma 4 one should use much more precise bound of Theorem 9. In this case, there is no need to implement an iterative argument improving the bound, the concentration inequality in its explicit form (Theorem 5) follows just by an application of the Gaussian isoperimetric inequality. Adamczak suggested an alternative approach to the proof of Theorem 5. It is based on a version of a concentration inequality for Gaussian chaos and on some other tools (such as Gordon-Chevet inequality), but it does not rely on the generic chaining bounds.

Acknowledgments. The authors are very thankful to Sjoerd Dirksen for attracting their attention to paper . Radek Adamczak pointed out that a similar result was proved in .

The authors are especially thankful to Radek Adamczak for providing an alternative proof of the concentration inequality and for very helpful discussions. The initial version of Theorem 5 was under an extra assumption that ${\bf r}(\Sigma)\lesssim e^{2n}.$ We improved our argument after Adamczak had provided his alternative proof.

Introduction

Main results

Proof of the concentration inequality

References