High-dimensional covariance matrix estimation with missing observations

Karim Lounici

Introduction

We can think of the δi,j\delta_{i,j} as masked variables. If δi,j=0\delta_{i,j}=0, then we cannot observe the jj-th component of XiX_{i} and the default value is assigned to Yi(j)Y_{i}^{(j)}. Our goal is then to estimate Σ\Sigma given the partial observations Y1,,YnY_{1},\ldots,Y_{n}.

The statistical problem of covariance estimation with missing observations is fundamental in multivariate statistics since it is often used as the first step to retrieve information in numerous applications where datasets with missing observations are common:

Climate studies: nn is the number of time points and pp the number of observations stations, which may sometimes fail to produce an observation due to break down of measure instruments. As a consequence, the generated datasets usually contain missing values.

Gene expression micro-arrays: nn is the number of measurements and pp the number of tested genes. Despite the improvement of genes expression techniques, the generated datasets frequently contain missing values with up to 90%90\% of genes affected.

Cosmology: nn is the number of images produced by a telescope and pp is the number of pixels per image. With the development of very large telescopes and wide sky surveys, the generated datasets are huge but usually contain missing observations due to partial sky coverage or defective pixel.

One simple strategy to deal with missing data is to exclude from the analysis any variable for which observations are missing, thus restricting the analysis to a subset of fully observed variables. In gene expression data where 90%90\% of the genes are affected by missing values, we would be left with too few variables so that the legitimacy of the statistical analysis becomes questionable. Also, discarding variables with very few missing observations is a waste of available information. Existing procedures involve complex imputation techniques to fill in the missing values through computationally intensive implementation of the EM algorithm, see and the references cited therein for more details. In this paper, we propose a simple procedure computationally tractable in high-dimension that does not require to imput missing observations or to discard any available observation to recover the covariance matrix Σ\Sigma.

Several alternatives have been studied in the literature to provide better estimates of the covariance matrix in the high-dimensional setting. A popular approach in Gaussian graphical models consists in estimating the inverse of the covariance matrix (called concentration matrix) since it admits a naturally sparse (or approximately sparse) structure if the dependence graph is itself sparse. See and the references cited therein for more details. A limitation of this approach is that it does not apply to low rank matrices Σ\Sigma since the concentration matrix does not exist in this case. An other popular approach assumes that the unknown covariance matrix is sparse,that is most of the entries are exactly or approximately zero and then proposes to perform either entrywise thresholding or tapering of the sample covariance matrix . Note that the sparsity notion adopted in this approach is not adapted to strongly correlated datasets with dense covariance matrix.

In random matrix theory, an important line of work, and the references cited therein studied the asymptotic distribution of the sample covariance matrix eigenvalues for different settings of nn and pp. See also for a very nice survey of existing non-asymptotic results on the spectral norm deviation of the sample covariance matrix from its population counterpart. In this paper, we adopt this approach and we will provide further details as we present our results.

Note that the results derived in the works cited above do not cover datasets with missing observations. For instance, when the data contains no missing observation (δ=1\delta=1), established a non-asymptotic control on the stochastic deviation ΣnΣ\|\Sigma_{n}-\Sigma\|_{\infty} of the empirical covariance matrix Σn=1ni=1nXiXi\Sigma_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\otimes X_{i} provided some tails conditions are satisfied by the common distribution of X1,,XnX_{1},\ldots,X_{n}. Exploiting these results, it is possible to establish oracle inequalities for the covariance version of the matrix Lasso estimator

where Sp\mathcal{S}_{p} is the set of p×pp\times p positive-semidefinite symmetric matrices, S2\|S\|_{2} and S1\|S\|_{1} are respectively the Frobenius and nuclear norm of SS and λ>0\lambda>0 is a regularization parameter that should be chosen of the order of magnitude of ΣnΣ\|\Sigma_{n}-\Sigma\|_{\infty}. This estimator is the covariance version of the matrix Lasso estimator initially introduced in the matrix regression framework, see and the references cited therein. To the best of our knowledge, the procedure (1.2) has not been studied in the covariance estimation problem.

When the data contains missing observations (δ<1\delta<1), we no longer have access to Σn\Sigma_{n}. Given the observations Y1,,YnY_{1},\ldots,Y_{n}, we can build the following empirical covariance matrix

We present now our reconstruction procedure based on the following simple observation

Therefore, we can define the following unbiased estimator of Σ\Sigma when the data set contains missing observations

Our estimator is then solution of the following penalized empirical risk minimization problem:

The rest of the paper is organized as follows. In Section 2, we recall some tools and definitions. In Section 3, we establish oracle inequalities for the Frobenius and spectral norms for our procedure (1.5) and also propose a data-driven choice of the regularization parameter. In section 4, we establish minimax lower bounds for data with missing observations δ(0,1]\delta\in(0,1], thus showing that our procedures are minimax optimal up to a logarithmic factor. Finally, Section 5 contains all the proofs of the paper.

We emphasize that the results of this paper are non-asymptotic in nature, hold true for any setting of n,pn,p, are minimax optimal (up to a logarithmic factor) and do not require the unknown covariance matrix Σ\Sigma to be low-rank. We note also that to the best of our knowledge, there exists in the literature no minimax lower bound result for statistical problem with missing observations.

Tools and definitions

Denote by Sp\mathcal{S}_{p} the set of p×pp\times p symmetric positive-semidefinite matrices. Any matrix ASpA\in\mathcal{S}_{p} admits the following spectral representation

The Schatten qq-norm of ASpA\in\mathcal{S}_{p} is defined by

We will also use the fact that the subdifferential of the convex function AA1A\mapsto\|A\|_{1} is the following set of matrices :

We recall now the definition and some basic properties of sub-exponential random vectors.

The ψα\psi_{\alpha}-norms of a real-valued random variable VV are defined by

We recall some well-known properties of sub-exponential random variables:

For any real-valued random variable VV such that Vα<\|V\|_{\alpha}<\infty for some α1\alpha\geq 1, we have

If a real-valued random variable VV is sub-gaussian, then V2V^{2} is sub-exponential. Indeed, we have

We recall the Bernstein inequality for sub-exponential real-valued random variables (see for instance Corollary 5.17 in )

Let Y1,,YnY_{1},\ldots,Y_{n} be independent centered sub-exponential random variables, and K=maxiYiψ1K=\max_{i}\|Y_{i}\|_{\psi_{1}}. Then for every t0t\geq 0, we have with probability at least 1et1-e^{-t}

The following proposition is the matrix version of Bernstein’s inequality for bounded random matrices (see also Corollary 9.1 in ).

Then, for all t>0,t>0, with probability at least 1et1-e^{-t} we have

Oracle inequalities

We can now state the main result for the procedure (1.5).

The intrinsic dimension of the matrix Σ\Sigma can be measured by the effective rank

We have the following result, which requires no condition on the covariance matrix Σ\Sigma.

The natural choice for tt is of the order of magnitude log(2p)\log(2p). Then the conclusions of Proposition 3 hold true with probability at least 112p1-\frac{1}{2p}. In addition, if the number of measurements nn is sufficiently large

where c>0c>0 is a sufficiently large numerical constant, then an acceptable choice for the regularization parameter λ\lambda is

where the absolute constant C>0C>0 is sufficiently large.

As we claimed in the introduction, Proposition 3 requires no condition on Σ\Sigma whatsoever. However, for the result to be of any practical interest, we need the bound in (3.5) to be small, which is the case if the condition (3.7) is satisfied. This condition is interesting since it shows that the number of measurements sufficient to guarantee a precise enough estimation of the spectrum of Σ\Sigma grows with the effective rank r(Σ)\mathbf{r}(\Sigma). In particular, when no observation is missing (δ=1\delta=1), if Σ\Sigma is approximately low-rank so that r(Σ)p\mathbf{r}(\Sigma)\ll p, then only n=O(r(Σ)log2(2p))n=\mathcal{O}\left(\mathbf{r}(\Sigma)\log^{2}(2p)\right) measurements are sufficient to estimate precisely the spectrum of the p×pp\times p covariance matrix Σ\Sigma.

Note that if we assume that YY=Y22U\|Y\otimes Y\|_{\infty}=|Y|_{2}^{2}\leq U a.s. for some constant U>0U>0, then we can eliminate the (c1δ+t+logn)(c_{1}\delta+t+\log n) factor in (3.5). Consequently, we can replace the condition (3.7) on the number of measurements by the following less restrictive one

for some absolute constant c>0c>0 sufficiently large. When there is no missing observation (δ=1\delta=1), we obtain the standard condition on the number of measurements (see Remark 5.53 in ). When some observations are missing (δ<1\delta<1), we have the additional quantity δ2\delta^{2} in the denominators of (3.5) and (3.7). The bound (3.5) is degraded in the case δ<1\delta<1 since we observe less entries per measurement. Consequently, as we can see it in (3.7), if we denote by N(ϵ)N(\epsilon) the number of necessary measurements to estimate Σ\Sigma with a precision ϵ\epsilon when no observation is missing (δ=1\delta=1), then we will need at least O(N(ϵ)/δ2)\mathcal{O}\left(N(\epsilon)/\delta^{2}\right) measurements in order to estimate Σ\Sigma with the same precision ϵ\epsilon when some observations are missing (δ<1\delta<1). In Theorem 2, we prove in particular that the dependence of the bound (3.5) on δ\delta is sharp by establishing a minimax lower bound.

Proposition 3 and Equation (3.8) give some insight on the tuning of the regularization parameter:

Let the assumptions of Proposition 3 be satisfied. Assume in addition that (3.7) holds true. Take λ\lambda as in (3.9) with C>0C>0 a large enough constant that can depend only on c1c_{1}. Then, we have with probability at least 112p1-\frac{1}{2p} that

where C>0C^{\prime}>0 can depend only on c1c_{1}.

We obtain the following corollary of Theorem 1.

Let Assumption 1 be satisfied. Assume that (3.7) is satisfied. Consider the estimator (1.5) with the regularization parameter λ\lambda satisfying (3.9). Then we have, with probability at least 112p1-\frac{1}{2p} that

where C1,C2>0C_{1},C_{2}>0 can depend only on c1c_{1}.

The proof of this corollary is immediate by combining Theorem 1 with Proposition 3 and Lemma 1.

Lower bounds

We now establish a minimax lower bound that guarantees the rates we obtained in Corollary 1 are optimal up to a logarithmic factor on the probability distribution class Pr\mathcal{P}_{r}. In particular, the dependence of our rates on δ\delta, Σ\|\Sigma\|_{\infty} and r(Σ)\mathbf{r}(\Sigma) is sharp.

where (δij)1in,1jp(\delta_{ij})_{1\leq i\leq n,\,1\leq j\leq p} is an i.i.d. sequence of Bernoulli B(δ)B(\delta) random variables independent of X1,,XnX_{1},\ldots,X_{n}.

Then, there exist absolute constants β(0,1)\beta\in(0,1) and c>0c>0 such that

where infΣ^\inf_{\hat{\Sigma}} denotes the infimum over all possible estimators Σ^\hat{\Sigma} of Σ\Sigma based on Y1,,YnY_{1},\ldots,Y_{n}.

Proofs

The proof of the first inequality adapts to covariance matrix estimation the arguments used in the trace regression problem to prove Theorems 1 and 11 in .

proof. By definition of Σ^λ\hat{\Sigma}^{\lambda}, we have for any SSpS\in\mathcal{S}_{p}

Next, a necessary and sufficient condition of minimum for problem (1.5) implies that there exists V^Σ^λ1\hat{V}\in\partial\|\hat{\Sigma}^{\lambda}\|_{1} such that for all SSpS\in\mathcal{S}_{p}

For any SSpS\in\mathcal{S}_{p} of rank rr with spectral representation S=j=1rσjujujS=\sum_{j=1}^{r}\sigma_{j}u_{j}\otimes u_{j} and support LL, It follows from (5.1) that

for an arbitrary VS1V\in\partial\|S\|_{1}. Note that V^V,Σ^λS0\langle\hat{V}-V,\hat{\Sigma}^{\lambda}-S\rangle\geq 0 by monotonicity of subdifferentials of convex functions and that the following representation holds

where WW is an arbitrary matrix with W1\|W\|_{\infty}\leq 1. In particular, there exists WW with W1\|W\|_{\infty}\leq 1 such that

For this choice of WW, we get from (5.2) that

Using Cauchy-Schwarz’s inequality and trace duality, we get

The above display combined with (5.1) give

Finally, we get on the event λ2Δ1\lambda\geq 2\|\Delta_{1}\|_{\infty} that

We now prove the spectral norm bound. Note first that the solution of (1.5) is given by

Next, we have on the event λ2Δ1\lambda\geq 2\|\Delta_{1}\|_{\infty}

2 Proof of Proposition 3

The delicate part of this proof is to obtain the sharp dependence on δ\delta. As a consequence, the proof is significantly more technical as compared to the case of full observations δ=1\delta=1. To simplify the understanding of this proof, we decomposed it into three lemmas that we prove below.

Now combining a simple union bound argument with Lemmas 2, 3 and 4, we get with probability at least 14et1-4e^{-t} that

Under the assumptions of Proposition 3, we have with probability at least 1et1-e^{-t} that

Next, since the random variables δi,j\delta_{i,j} and Xi(j)X_{i}^{(j)} are sub-gaussian for any i,ji,j, we have

where we have used Assumption 1 in the last inequality. We can apply Bernstein’s inequality (see Proposition 1 in the appendix below) to get for any 1jp1\leq j\leq p with probability at least 1et1-e^{-t^{\prime}} that

where C>0C>0 is an absolute constant. Next, taking t=t+logpt^{\prime}=t+\log p combined with a union bound argument we get the result.

Under the assumptions of Proposition 3, we have with probability at least 12et1-2e^{-t} that

where C>0C>0 is a large enough absolute constant.

where we have applied Cauchy-Schwarz’s inequality.

We have again by Cauchy-Schwarz’s inequality and Assumption 1 that

for some absolute constant C>0C>0. We have also, in view of (3.3), with the same absolute constant CC as above

Combining the three above displays with (5.2), we get

Combining the two above displays with (5.7), we get

where we have used that ZYY=Y22\|Z\|_{\infty}\leq\|Y\otimes Y\|_{\infty}=|Y|_{2}^{2} and

Then, combining Proposition 1 with a union bound argument gives for any t>0t>0

where C>0C^{\prime}>0 is a large enough absolute constant.

where we have used Proposition 2 to get that

Under the assumptions of Proposition 3, we have with probability at least 1et1-e^{-t} that

proof. In view of Assumption 1, we have for any 1jp1\leq j\leq p that (Y(j))2ψ1(X(j))2ψ12X(j)ψ222c11Σjj\|(Y^{(j)})^{2}\|_{\psi_{1}}\leq\|(X^{(j)})^{2}\|_{\psi_{1}}\leq 2\|X^{(j)}\|_{\psi_{2}}^{2}\leq 2c_{1}^{-1}\Sigma_{jj} and

Then we can apply Proposition 1 to get the result.

3 Proof of Lemma 1

In view of Proposition 3, we have on an event A\mathcal{A} of probability at least 112p1-\frac{1}{2p} that

We assume further that (3.7) is satisfied with a sufficiently large constant cc so that we have, in view of (3.5) and (3.6), on the same event A\mathcal{A} that

We immediately get on the event A\mathcal{A} that

Combining these simple facts with (5.11), we get the result.

4 Proof of Theorem 2

This proof uses standard tools of the minimax theory (cf. for instance ). However, as for Proposition 3, the proof with missing observations (δ<1\delta<1) is significantly more technical as compared to case of full observations (δ=1\delta=1). In particular, the control of the Kullback-Leibler divergence requires a precise description of the conditional distributions of the random variables Y1,,YnY_{1},\ldots,Y_{n} given the masked variables δ1,,δn\delta_{1},\ldots,\delta_{n}. To our knowledge, there exists no minimax lower bound result for statistical problem with missing observations in the literature.

Set γ=a/δ2n\gamma=a/\sqrt{\delta^{2}n} where a>0a>0 is a sufficiently small absolute constant.

We consider first the case r2r\geq 2. Define

Set Bk,l=Ek,l+El,kB_{k,l}=E_{k,l}+E_{l,k} for any 1kr1,k+1lr1\leq k\leq r-1,\,k+1\leq l\leq r. Consider the associated set of symmetric matrices

Note that any matrix ΣϵB(N)\Sigma_{\epsilon}\in\mathcal{B}(\mathcal{N}) is positive-semidefinite if 0<a<10<a<1 since we have by assumption

in view of the condition nδ2r2n\geq\delta^{-2}r^{2}. A similar reasoning gives the lower bound.

Using that YiδiN(0,Σ(δi))Y_{i}\mid\delta_{i}\sim N(0,\Sigma^{(\delta_{i})}) with Σ(δi)\Sigma^{(\delta_{i})} defined in (5.14), we get for any 1in1\leq i\leq n, any ΣA0\Sigma\in{\cal A}^{0} and any realization δi(ω){0,1}p\delta_{i}(\omega)\in\{0,1\}^{p} that

We note that PiA0(δi)Pi=IdiP_{i}A_{0}^{(\delta_{i})}P_{i}^{*}=I_{d_{i}} and

Denote by λ1,,λdi\lambda_{1},\ldots,\lambda_{d_{i}} the eigenvalues of WiW_{i}. Note that λj<1/2|\lambda_{j}|<1/2 for any j=1,,dij=1,\ldots,d_{i} in view of (5.12) if a<1/2a<1/2. We get, using the inequality xlog(1+x)x2x-\log(1+x)\leq x^{2} for any x>1/2x>-1/2, that

Taking the expectation w.r.t. to δi\delta_{i} in the above display, we get for any 1in1\leq i\leq n that

Thus, we deduce from the above display that the condition

is satisfied for any α>0\alpha>0 if a>0a>0 is chosen as a sufficiently small numerical constant depending on α\alpha. In view of (5.4) and (5.17), (4.1) now follows by application of Theorem 2.5 in .

The lower bound (4.2) follows from (4.1) by the following simple argument. Consider the set of matrices A0\mathcal{A}^{0}. For any two distinct matrices Σ1,Σ2\Sigma_{1},\Sigma_{2} of A0\mathcal{A}^{0}, we have

Next, (5.17) is satisfied for any α>0\alpha>0 if a>0a>0 is chosen as a sufficiently small numerical constant depending on α\alpha.

Combining (5.18) with (5.17) and Theorem 2.5 in gives the result.

for some absolute constant c>0c>0. The rest of the proof is identical to the case r2r\geq 2.

Acknowledgment:

I wish to thank Professor Vladimir Koltchinskii for suggesting this problem and the observation (1.3).

References