Concentration Inequalities and Moment Bounds for Sample Covariance Operators

Vladimir Koltchinskii, Karim Lounici

Introduction

The nuclear norm A1\|A\|_{1} is defined as the infimum of the sums (1.2) over all the sequences {xn:n1}E1, {yn:n1}E2\{x_{n}:n\geq 1\}\subset E_{1}^{\ast},\ \{y_{n}:n\geq 1\}\subset E_{2} such that representation (1.1) holds.

Let X1,,XnX_{1},\dots,X_{n} be i.i.d. copies of X.X. The sample (empirical) covariance operator based on the observations (X1,,Xn)(X_{1},\dots,X_{n}) is defined as the operator Σ^:EE\hat{\Sigma}:E^{\ast}\mapsto E such that

Clearly, this is an operator of rank at most nn and it is an unbiased estimator of the covariance operator Σ.\Sigma.

If ξ,ξ1,,ξn\xi,\xi_{1},\dots,\xi_{n} are i.i.d. centered random variables with ξψ1<+,\|\xi\|_{\psi_{1}}<+\infty, then the sum ξ1++ξn\xi_{1}+\dots+\xi_{n} satisfies the following version of Bernstein’s inequality: for all t0t\geq 0 with probability at least 1et1-e^{-t}

Our goal is to obtain moment bounds and concentration inequalities for the operator norm Σ^Σ.\|\hat{\Sigma}-\Sigma\|. It turns out that both the size of the expectation of random variable Σ^Σ\|\hat{\Sigma}-\Sigma\| and its concentration around its mean can be characterized in terms of the operator norm Σ\|\Sigma\| and another parameter defined below.

Assuming that XX is a centered Gaussian random variable in EE with covariance operator Σ,\Sigma, define

The main results of the paper include the following:

under an assumption that X,X1,,XnX,X_{1},\dots,X_{n} are i.i.d. centered Gaussian random variables in EE with covariance operator Σ,\Sigma, it will be shown that

Moreover, under an additional assumption that r(Σ)n,{\bf r}(\Sigma)\lesssim n, the following concentration inequality holds for some constant C>0C>0 and for all t1t\geq 1 with probability at least 1et:1-e^{-t}:

Under an assumption that r(Σ)n,{\bf r}(\Sigma)\gtrsim n, the concentration inequality becomes

Main results

The problem of bounding the operator norm Σ^Σ\|\hat{\Sigma}-\Sigma\| has been intensively studied, especially, in the finite-dimensional case (see and references therein). The focus has been on understanding of dependence of this norm on the dimension of the space and on the sample size nn (that could be simultaneously large) as well as on the tails of linear forms X,u,uE\langle X,u\rangle,u\in E and of the norm X\|X\| of random variable X.X. Many results that hold for Gaussian random variables are also true in a slightly more general subgaussian case.

A centered random variable XX in EE will be called subgaussian iff, for all uE,u\in E^{\ast},

We will also need the following definition.

A weakly square integrable centered random variable XX in EE with covariance operator Σ\Sigma is called pregaussian iff there exists a centered Gaussian random variable YY in EE with the same covariance operator Σ.\Sigma.

There exists an absolute constant C>0C>0 such that, for all t1,t\geq 1, with probability at least 1et1-e^{-t}

The proof of this theorem is based on a simple ε\varepsilon-net argument that allows one to reduce bounding the operator norm Σ^Σ\|\hat{\Sigma}-\Sigma\| to bounding the finite maximum

where MSd1M\subset S^{d-1} is a 1/41/4-net of the unit sphere of cardinality card(M)9d.{\rm card}(M)\leq 9^{d}. The bounding of the finite maximum is based on a version of Bernstein inequality for the sum of independent ψ1\psi_{1} random variables Xj,u2\langle X_{j},u\rangle^{2} combined with the union bound (see the proof of Theorem 5.39 in and the comments after this theorem).

Another approach to bounding the operator norm Σ^Σ\|\hat{\Sigma}-\Sigma\| was developed by Rudelson and it is based on a noncommutative Khintchine inequality due to Lust-Picard and Pisier . This method can be used not only in subgaussian, but also in “heavy tailed” cases and it leads, for instance, to the following expectation bound (see Vershynin , Theorem 5.48):

In each of the above approaches, the bounds are not dimension free (at least, with a straightforward application of noncommutative Bernstein or Khintchine inequalities) and they could not be directly used in the infinite-dimensional case. We will use below a different approach based on recent deep results on generic chaining bounds for empirical processes. The following facts about generic chaining complexities will be needed. Let Nn:=22n,n1N_{n}:=2^{2^{n}},n\geq 1 and N0:=1.N_{0}:=1. Given a metric space (T,d),(T,d), an increasing sequence Δn\Delta_{n} of partitions of TT is called admissible if card(Δn)Nn.{\rm card}(\Delta_{n})\leq N_{n}. For tT,t\in T, Δn(t)\Delta_{n}(t) denotes the unique set of the partition Δn\Delta_{n} that contains t.t. For AT,A\subset T, D(A)D(A) denotes the diameter of set A.A. Define

where the infimum is taken over all admissible sequences.

The following fundamental result is due to Talagrand (see ; it was initially stated it terms of majorizing measures rather than generic complexities).

Let X(t),tTX(t),t\in T be a centered Gaussian process and suppose that

Then, there exists an absolute constant K>0K>0 such that

In what follows, generic chaining complexities are used in the case when T=FT={\mathcal{F}} is a function class on a probability space (S,A,P)(S,{\mathcal{A}},P) and dd is the metric generated by either L2(P)L_{2}(P)-norm, or by the ψ2\psi_{2}-norm with respect to P.P. We will use the following result due to Mendelson (although an earlier, simpler and weaker version, with supfFfψ2\sup_{f\in{\cal F}}\|f\|_{\psi_{2}} instead of supfFfψ1,\sup_{f\in{\cal F}}\|f\|_{\psi_{1}}, that goes back to Klartag and Mendelson would suffice for our purposes).

Let X,X1,,XnX,X_{1},\ldots,X_{n} be i.i.d. weakly square integrable centered random vectors in EE with covariance operator Σ.\Sigma. If XX is subgaussian and pregaussian, then

proof. The proof of the upper bound relies on the generic chaining bound of Theorem 3, while the proof of the lower bound is rather elementary.

where {\mathcal{F}}:=\Bigl{\{}\langle\cdot,u\rangle:u\in U_{E^{\ast}}\Bigr{\}}, UE:={uE:u1}U_{E^{\ast}}:=\{u\in E^{\ast}:\|u\|\leq 1\} and PP is the distribution of random variable X.X.

Since XX is subgaussian, the ψ1\psi_{1}- and ψ2\psi_{2}-norms of linear functionals X,u\langle X,u\rangle are both equivalent to the L2L_{2}-norm. This implies that

Also, since XX is pregaussian, there exists a centered Gaussian random variable YY in EE with the same covariance Σ.\Sigma. This means that

Using Talagrand’s Theorem 2, we easily get that

Lower Bound. To prove the lower bound, note that

For a fixed uEu\in E^{\ast} with u1\|u\|\leq 1 and Σu,u>0,\langle\Sigma u,u\rangle>0, denote

By a straightforward computation, for all vE,v\in E^{\ast}, the random variables X,u\langle X,u\rangle and X,v\langle X^{\prime},v\rangle are uncorrelated. Since they are jointly Gaussian, it follows that X,u\langle X,u\rangle and XX^{\prime} are independent. Define

Then {Xj:j=1,,n}\{X_{j}^{\prime}:j=1,\dots,n\} and {Xj,u:j=1,,n}\{\langle X_{j},u\rangle:j=1,\dots,n\} are also independent. We easily get

Note that, conditionally on Xj,u,j=1,,n,\langle X_{j},u\rangle,j=1,\dots,n, the distribution of random variable

is Gaussian and it coincides with the distribution of the random variable

are i.i.d. standard normal random variables. It is easy to check that

for a positive numerical constant c2,c_{2}, implying that

We now combine this bound with (2) and (2) to get

for some numerical constant c3>0.c_{3}>0. Thus, we get

provided c2c_{2} is chosen to be small enough to satisfy c22πc3.c_{2}\sqrt{\frac{2}{\pi}}\leq c_{3}.

This completes the proof in the case when r(Σ)2n{\bf r}(\Sigma)\leq 2n since in this case

On the other hand, under the assumption that r(Σ)2n,{\bf r}(\Sigma)\geq 2n,

which completes the proof in the case when r(Σ)2n.{\bf r}(\Sigma)\geq 2n.

Our next goal is to prove a concentration inequality for Σ^Σ\|\hat{\Sigma}-\Sigma\| around its median or around its expectation. In what follows, Med(ξ){\rm Med}(\xi) denotes a median of a random variable ξ.\xi.

Let X,X1,,XnX,X_{1},\ldots,X_{n} be i.i.d. centered Gaussian random vectors in EE with covariance Σ\Sigma and let MM be either the median, or the expectation of Σ^Σ.\|\hat{\Sigma}-\Sigma\|. Then, there exists a constant C>0C>0 such that the folllowing holds. If r(Σ)n,{\bf r}(\Sigma)\leq n, then for all t1,t\geq 1, with probability at least 1et,1-e^{-t},

On the other hand, if r(Σ)n,{\bf r}(\Sigma)\geq n, then with the same probability

In the case when MM is the median, this result is an immediate consequence of Theorem 4 and Theorem 6 that is given below and that provides an equivalent concentration inequality written in a somewhat implicit form. The bounds of Theorem 5 in the case when MM is the median imply that

when r(Σ)n.{\bf r}(\Sigma)\geq n. This, in turn, implies the concentration bound in the case when MM is the expectation.

Let X,X1,,XnX,X_{1},\ldots,X_{n} be i.i.d. centered Gaussian random vectors in EE with covariance Σ\Sigma and let MM be the median of Σ^Σ.\|\hat{\Sigma}-\Sigma\|. Then, there exists a constant C>0C>0 such that for all t1t\geq 1 with probability at least 1et,1-e^{-t},

The proof of Theorem 6 is somewhat long and will be given in the next section. Here we will state a couple corollaries of this theorem.

Under the assumptions and notations of Theorem 6, there exists a constant C>0C>0 such that, for all t1,t\geq 1, with probability at least 1et,1-e^{-t},

proof. The proof easily follows from the next simple bound: 2Σ1/2M1/2tnM+Σtn.2\|\Sigma\|^{1/2}M^{1/2}\sqrt{\frac{t}{n}}\leq M+\|\Sigma\|\frac{t}{n}.

The following corollary can be viewed as an infinite-dimensional generalization of Theorem 1.

Under the assumptions and notations of Theorem 6, there exists a constant C>0C>0 such that, for all t1,t\geq 1, with probability at least 1et,1-e^{-t},

proof. Bound (2.8) follows immediately from Corollary 1 and Theorem 4. Bound (2.9) follows from (2.8) by integrating the tail probabilities.

Proof of the concentration inequality

In this section, we provide a proof of Theorem 6. We will use the following well known fact (see, e.g., ).

Let XX be a centered Gaussian random variable in a separable Banach space E.E. Then there exists a sequence {xk:k1}\{x_{k}:k\geq 1\} of vectors in EE and a sequence {Zk:k1}\{Z_{k}:k\geq 1\} of i.i.d. standard normal random variables such that

where the series in the right hand side converges in EE a.s. and

Note that under the assumptions and notations of Theorem 7,

It easily follows from Theorem 7 that, for X(m):=k=1mZkxk,X^{(m)}:=\sum_{k=1}^{m}Z_{k}x_{k}, we have

Let now Σ(m)\Sigma^{(m)} be the covariance operator of X(m)X^{(m)} and Σ^(m)\hat{\Sigma}^{(m)} be the sample covariance operator based on observations (X1(m),,Xn(m))(X_{1}^{(m)},\dots,X_{n}^{(m)}) (with the notation Xj(m)X_{j}^{(m)} having an obvious meaning and the sample size nn being fixed). Then,

Thus, it is enough to prove the theorem only in the case when

The general case would then follow by a straightforward limiting argument.

The main ingredient of the proof is the classical Gaussian concentration inequality (see, e.g., Ledoux and Talagrand , p. 21).

where Φ\Phi is the distribution function of a standard normal random variable.

This result easily follows from the Gaussian isoperimetric inequality. We will also need another consequence of this inequality:

Under the assumptions of Lemma 1, suppose that for some MM and for some α>0\alpha>0

Then, there exists a constant D>0D>0 (possibly depending on α\alpha) such that, for all t1,t\geq 1, with probability at least 1et,1-e^{-t},

Lemma 1 will be applied to the function f(Z)=g(X1,,Xn).f(Z)=g(X_{1},\dots,X_{n}). We have to check the Lipschitz condition for this function. To this end, we will prove the following lemma.

proof. Obviously, 0g(X1,,Xn)2δ,0\leq g(X_{1},\dots,X_{n})\leq 2\delta, 0g(X1,,Xn)2δ,0\leq g(X_{1}^{\prime},\dots,X_{n}^{\prime})\leq 2\delta, implying that

It is enough to consider the case when W2δ\|W\|\leq 2\delta or W2δ\|W^{\prime}\|\leq 2\delta (otherwise, the claim of the lemma is obvious). To be specific, assume that W2δ.\|W\|\leq 2\delta. Then, using the assumption that φ\varphi is Lipschitz with constant 1,1, we get

We will now control WW.\|W-W^{\prime}\|. Note that

Substituting the last bound in (3.3), we get

In view of (3.2), the left hand side is also bounded from above by 2δ,2\delta, which allows one to get from (3.5) that

It is also easy to check that the same bound holds in the opposite case, too. As a consequence, (3.6) implies that with some numerical constant D>0,D>0,

Combining this with bound (3.7) yields (3.1).

It follows from lemmas 1 and 3 that, for all t1t\geq 1 with probability at least 1et,1-e^{-t},

where D1D_{1} is a numerical constant. We will use this bound to get that, on the event where Wδ\|W\|\leq\delta and, at the same time, concentration bound (3.8) holds, we have

There exists a constant D2>0D_{2}>0 such that for all t>0,t>0, with probability at least 1et1-e^{-t}

In particular, this implies that, for some constant D2>0,D_{2}>0,

and, by Bernstein’s inequality for ψ1\psi_{1}-random variables, with probability at least 1et1-e^{-t}

The proof of (3.11) immediately follows by taking t=log2.t=\log 2.

We will define δk\delta_{k} for k1k\geq 1 as follows:

It is easy to see that δ1δ0\delta_{1}\leq\delta_{0} (provided that constant D2D_{2} is chosen to be sufficiently large). Note also that

Thus, by induction, δk,k0\delta_{k},k\geq 0 is a nonincreasing sequence. In view of definition of δk,\delta_{k}, it follows from (3.9) that for all k1k\geq 1

Define uk,k0u_{k},k\geq 0 as follows: u0=δ0,u_{0}=\delta_{0},

where we also used (3.14). Taking into account (3.12) and (3.13), we get that for some constant D3>0D_{3}>0

and also that, for some constant c1>0c_{1}>0 and for t1t\geq 1

Using now (3.15) with t+log(kˉ+1)t+\log(\bar{k}+1) instead of t,t, it is easy to get that with probability at least 1et1-e^{-t}

where we used the notation logx:=logloglogx.\log^{}x:=\log\log\log x. In the case when r(Σ)n,{\bf r}(\Sigma)\lesssim n, we have

Hence, doubling the value of the constant c1c_{1} allows us to drop the two terms involving log(c1r(Σ))n.\frac{\log^{}(c_{1}{\bf r}(\Sigma))}{n}. On the other hand, assume that r(Σ)Cn{\bf r}(\Sigma)\geq C^{\prime}n with a sufficiently large constant CC^{\prime} (to be determined later). Observe that log(c1r(Σ))r(Σ)\log^{}(c_{1}{\bf r}(\Sigma))\lesssim{\bf r}(\Sigma) and we can use a bound for the median MM similar to (2.3):

for some constants c>0c^{\prime}>0 and for C2/c.C^{\prime}\geq 2/c^{\prime}. We also used the fact that for Gaussian XX

Since also log(c1n)n1,\frac{\log^{}(c_{1}n)}{n}\lesssim 1, this implies that with some constant C1C_{1} and with the same probability

Take now δ\delta to be equal to the expression in the right hand side of bound (3) and use this value of δ\delta to do another iteration of bound (3.9). This easily yields that with some constant C>0C>0 and with probability at least 12et1-2e^{-t}

To complete the proof of concentration inequality (2.6), note that, for an arbitrary δ>0,\delta>0, on the event where (3.8) holds and also Wδ,\|W\|\leq\delta,

(provided that t1t\geq 1). Note also that

Then, it follows from Lemma 2 that, for a sufficiently large constant D1D_{1} and for all t1,t\geq 1, with probability at least 1et,1-e^{-t}, the following bound holds:

Recall also that g(X1,,Xn)=Wg(X_{1},\dots,X_{n})=\|W\| on the event where Wδ\|W\|\leq\delta of probability at least 12et2n1et.1-2e^{-t-2n}\geq 1-e^{-t}. Therefore, with probability at least 13et,1-3e^{-t},

The result now follows by substituting δ\delta given by (3.19) into bound (3.20), doing simple algebra and adjusting the value of constant D1D_{1} to get the probability bound 1et.1-e^{-t}.

Very recent exponential generic chaining bounds for empirical processes by Dirksen (see Corollary 5.7) and by Bednorz (see Theorem 1) imply the following (earlier, Mendelson , Theorem 3.1 obtained another version of exponential generic chaining bounds for the same class of processes).

Let X,X1,,XnX,X_{1},\dots,X_{n} be i.i.d. random variables in a measurable space (S,A)(S,{\mathcal{A}}) with common distribution PP and let F{\cal F} be a class of measurable functions on (S,A).(S,{\mathcal{A}}). There exists a constant C>0C>0 such that for all t1t\geq 1 with probability at least 1et1-e^{-t}

This result together with the argument used in the proof of the upper bound of Theorem 4 easily implies the following generalization of Corollary 2.

Let X,X1,,XnX,X_{1},\ldots,X_{n} be i.i.d. weakly square integrable centered random vectors in EE with covariance operator Σ.\Sigma. If XX is subgaussian and pregaussian, then there exists a constant C>0C>0 such that, for all t1,t\geq 1, with probability at least 1et,1-e^{-t},

Note that the proof of concentration inequality of Theorem 6 does not rely on generic chaining bounds, it relies only on the Gaussian isoperimetric inequality. The bound of Theorem 9 (based on the generic chaining method) could be used to provide a shortcut in the proof of the concentration inequality. To this end, instead of using very rough initial bound δ0\delta_{0} based on Lemma 4 one should use much more precise bound of Theorem 9. In this case, there is no need to implement an iterative argument improving the bound, the concentration inequality in its explicit form (Theorem 5) follows just by an application of the Gaussian isoperimetric inequality. Adamczak suggested an alternative approach to the proof of Theorem 5. It is based on a version of a concentration inequality for Gaussian chaos and on some other tools (such as Gordon-Chevet inequality), but it does not rely on the generic chaining bounds.

Acknowledgments. The authors are very thankful to Sjoerd Dirksen for attracting their attention to paper . Radek Adamczak pointed out that a similar result was proved in .

The authors are especially thankful to Radek Adamczak for providing an alternative proof of the concentration inequality and for very helpful discussions. The initial version of Theorem 5 was under an extra assumption that r(Σ)e2n.{\bf r}(\Sigma)\lesssim e^{2n}. We improved our argument after Adamczak had provided his alternative proof.

References