High-dimensional covariance matrix estimation with missing observations

Karim Lounici

Introduction

We can think of the $\delta_{i,j}$ as masked variables. If $\delta_{i,j}=0$ , then we cannot observe the $j$ -th component of $X_{i}$ and the default value is assigned to $Y_{i}^{(j)}$ . Our goal is then to estimate $\Sigma$ given the partial observations $Y_{1},\ldots,Y_{n}$ .

The statistical problem of covariance estimation with missing observations is fundamental in multivariate statistics since it is often used as the first step to retrieve information in numerous applications where datasets with missing observations are common:

Climate studies: $n$ is the number of time points and $p$ the number of observations stations, which may sometimes fail to produce an observation due to break down of measure instruments. As a consequence, the generated datasets usually contain missing values.

Gene expression micro-arrays: $n$ is the number of measurements and $p$ the number of tested genes. Despite the improvement of genes expression techniques, the generated datasets frequently contain missing values with up to $90\%$ of genes affected.

Cosmology: $n$ is the number of images produced by a telescope and $p$ is the number of pixels per image. With the development of very large telescopes and wide sky surveys, the generated datasets are huge but usually contain missing observations due to partial sky coverage or defective pixel.

One simple strategy to deal with missing data is to exclude from the analysis any variable for which observations are missing, thus restricting the analysis to a subset of fully observed variables. In gene expression data where $90\%$ of the genes are affected by missing values, we would be left with too few variables so that the legitimacy of the statistical analysis becomes questionable. Also, discarding variables with very few missing observations is a waste of available information. Existing procedures involve complex imputation techniques to fill in the missing values through computationally intensive implementation of the EM algorithm, see and the references cited therein for more details. In this paper, we propose a simple procedure computationally tractable in high-dimension that does not require to imput missing observations or to discard any available observation to recover the covariance matrix $\Sigma$ .

Several alternatives have been studied in the literature to provide better estimates of the covariance matrix in the high-dimensional setting. A popular approach in Gaussian graphical models consists in estimating the inverse of the covariance matrix (called concentration matrix) since it admits a naturally sparse (or approximately sparse) structure if the dependence graph is itself sparse. See and the references cited therein for more details. A limitation of this approach is that it does not apply to low rank matrices $\Sigma$ since the concentration matrix does not exist in this case. An other popular approach assumes that the unknown covariance matrix is sparse,that is most of the entries are exactly or approximately zero and then proposes to perform either entrywise thresholding or tapering of the sample covariance matrix . Note that the sparsity notion adopted in this approach is not adapted to strongly correlated datasets with dense covariance matrix.

In random matrix theory, an important line of work, and the references cited therein studied the asymptotic distribution of the sample covariance matrix eigenvalues for different settings of $n$ and $p$ . See also for a very nice survey of existing non-asymptotic results on the spectral norm deviation of the sample covariance matrix from its population counterpart. In this paper, we adopt this approach and we will provide further details as we present our results.

Note that the results derived in the works cited above do not cover datasets with missing observations. For instance, when the data contains no missing observation ( $\delta=1$ ), established a non-asymptotic control on the stochastic deviation $\|\Sigma_{n}-\Sigma\|_{\infty}$ of the empirical covariance matrix $\Sigma_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\otimes X_{i}$ provided some tails conditions are satisfied by the common distribution of $X_{1},\ldots,X_{n}$ . Exploiting these results, it is possible to establish oracle inequalities for the covariance version of the matrix Lasso estimator

where $\mathcal{S}_{p}$ is the set of $p\times p$ positive-semidefinite symmetric matrices, $\|S\|_{2}$ and $\|S\|_{1}$ are respectively the Frobenius and nuclear norm of $S$ and $\lambda>0$ is a regularization parameter that should be chosen of the order of magnitude of $\|\Sigma_{n}-\Sigma\|_{\infty}$ . This estimator is the covariance version of the matrix Lasso estimator initially introduced in the matrix regression framework, see and the references cited therein. To the best of our knowledge, the procedure (1.2) has not been studied in the covariance estimation problem.

When the data contains missing observations ( $\delta<1$ ), we no longer have access to $\Sigma_{n}$ . Given the observations $Y_{1},\ldots,Y_{n}$ , we can build the following empirical covariance matrix

We present now our reconstruction procedure based on the following simple observation

Therefore, we can define the following unbiased estimator of $\Sigma$ when the data set contains missing observations

Our estimator is then solution of the following penalized empirical risk minimization problem:

The rest of the paper is organized as follows. In Section 2, we recall some tools and definitions. In Section 3, we establish oracle inequalities for the Frobenius and spectral norms for our procedure (1.5) and also propose a data-driven choice of the regularization parameter. In section 4, we establish minimax lower bounds for data with missing observations $\delta\in(0,1]$ , thus showing that our procedures are minimax optimal up to a logarithmic factor. Finally, Section 5 contains all the proofs of the paper.

We emphasize that the results of this paper are non-asymptotic in nature, hold true for any setting of $n,p$ , are minimax optimal (up to a logarithmic factor) and do not require the unknown covariance matrix $\Sigma$ to be low-rank. We note also that to the best of our knowledge, there exists in the literature no minimax lower bound result for statistical problem with missing observations.

Tools and definitions

Denote by $\mathcal{S}_{p}$ the set of $p\times p$ symmetric positive-semidefinite matrices. Any matrix $A\in\mathcal{S}_{p}$ admits the following spectral representation

The Schatten $q$ -norm of $A\in\mathcal{S}_{p}$ is defined by

We will also use the fact that the subdifferential of the convex function $A\mapsto\|A\|_{1}$ is the following set of matrices :

We recall now the definition and some basic properties of sub-exponential random vectors.

The $\psi_{\alpha}$ -norms of a real-valued random variable $V$ are defined by

We recall some well-known properties of sub-exponential random variables:

For any real-valued random variable $V$ such that $\|V\|_{\alpha}<\infty$ for some $\alpha\geq 1$ , we have

If a real-valued random variable $V$ is sub-gaussian, then $V^{2}$ is sub-exponential. Indeed, we have

We recall the Bernstein inequality for sub-exponential real-valued random variables (see for instance Corollary 5.17 in )

Let $Y_{1},\ldots,Y_{n}$ be independent centered sub-exponential random variables, and $K=\max_{i}\|Y_{i}\|_{\psi_{1}}$ . Then for every $t\geq 0$ , we have with probability at least $1-e^{-t}$

The following proposition is the matrix version of Bernstein’s inequality for bounded random matrices (see also Corollary 9.1 in ).

Then, for all $t>0,$ with probability at least $1-e^{-t}$ we have

Oracle inequalities

We can now state the main result for the procedure (1.5).

The intrinsic dimension of the matrix $\Sigma$ can be measured by the effective rank

We have the following result, which requires no condition on the covariance matrix $\Sigma$ .

The natural choice for $t$ is of the order of magnitude $\log(2p)$ . Then the conclusions of Proposition 3 hold true with probability at least $1-\frac{1}{2p}$ . In addition, if the number of measurements $n$ is sufficiently large

where $c>0$ is a sufficiently large numerical constant, then an acceptable choice for the regularization parameter $\lambda$ is

where the absolute constant $C>0$ is sufficiently large.

As we claimed in the introduction, Proposition 3 requires no condition on $\Sigma$ whatsoever. However, for the result to be of any practical interest, we need the bound in (3.5) to be small, which is the case if the condition (3.7) is satisfied. This condition is interesting since it shows that the number of measurements sufficient to guarantee a precise enough estimation of the spectrum of $\Sigma$ grows with the effective rank $\mathbf{r}(\Sigma)$ . In particular, when no observation is missing ( $\delta=1$ ), if $\Sigma$ is approximately low-rank so that $\mathbf{r}(\Sigma)\ll p$ , then only $n=\mathcal{O}\left(\mathbf{r}(\Sigma)\log^{2}(2p)\right)$ measurements are sufficient to estimate precisely the spectrum of the $p\times p$ covariance matrix $\Sigma$ .

Note that if we assume that $\|Y\otimes Y\|_{\infty}=|Y|_{2}^{2}\leq U$ a.s. for some constant $U>0$ , then we can eliminate the $(c_{1}\delta+t+\log n)$ factor in (3.5). Consequently, we can replace the condition (3.7) on the number of measurements by the following less restrictive one

for some absolute constant $c>0$ sufficiently large. When there is no missing observation ( $\delta=1$ ), we obtain the standard condition on the number of measurements (see Remark 5.53 in ). When some observations are missing ( $\delta<1$ ), we have the additional quantity $\delta^{2}$ in the denominators of (3.5) and (3.7). The bound (3.5) is degraded in the case $\delta<1$ since we observe less entries per measurement. Consequently, as we can see it in (3.7), if we denote by $N(\epsilon)$ the number of necessary measurements to estimate $\Sigma$ with a precision $\epsilon$ when no observation is missing ( $\delta=1$ ), then we will need at least $\mathcal{O}\left(N(\epsilon)/\delta^{2}\right)$ measurements in order to estimate $\Sigma$ with the same precision $\epsilon$ when some observations are missing ( $\delta<1$ ). In Theorem 2, we prove in particular that the dependence of the bound (3.5) on $\delta$ is sharp by establishing a minimax lower bound.

Proposition 3 and Equation (3.8) give some insight on the tuning of the regularization parameter:

Let the assumptions of Proposition 3 be satisfied. Assume in addition that (3.7) holds true. Take $\lambda$ as in (3.9) with $C>0$ a large enough constant that can depend only on $c_{1}$ . Then, we have with probability at least $1-\frac{1}{2p}$ that

where $C^{\prime}>0$ can depend only on $c_{1}$ .

We obtain the following corollary of Theorem 1.

Let Assumption 1 be satisfied. Assume that (3.7) is satisfied. Consider the estimator (1.5) with the regularization parameter $\lambda$ satisfying (3.9). Then we have, with probability at least $1-\frac{1}{2p}$ that

where $C_{1},C_{2}>0$ can depend only on $c_{1}$ .

The proof of this corollary is immediate by combining Theorem 1 with Proposition 3 and Lemma 1.

Lower bounds

We now establish a minimax lower bound that guarantees the rates we obtained in Corollary 1 are optimal up to a logarithmic factor on the probability distribution class $\mathcal{P}_{r}$ . In particular, the dependence of our rates on $\delta$ , $\|\Sigma\|_{\infty}$ and $\mathbf{r}(\Sigma)$ is sharp.

where $(\delta_{ij})_{1\leq i\leq n,\,1\leq j\leq p}$ is an i.i.d. sequence of Bernoulli $B(\delta)$ random variables independent of $X_{1},\ldots,X_{n}$ .

Then, there exist absolute constants $\beta\in(0,1)$ and $c>0$ such that

where $\inf_{\hat{\Sigma}}$ denotes the infimum over all possible estimators $\hat{\Sigma}$ of $\Sigma$ based on $Y_{1},\ldots,Y_{n}$ .

Proofs

The proof of the first inequality adapts to covariance matrix estimation the arguments used in the trace regression problem to prove Theorems 1 and 11 in .

proof. By definition of $\hat{\Sigma}^{\lambda}$ , we have for any $S\in\mathcal{S}_{p}$

Next, a necessary and sufficient condition of minimum for problem (1.5) implies that there exists $\hat{V}\in\partial\|\hat{\Sigma}^{\lambda}\|_{1}$ such that for all $S\in\mathcal{S}_{p}$

For any $S\in\mathcal{S}_{p}$ of rank $r$ with spectral representation $S=\sum_{j=1}^{r}\sigma_{j}u_{j}\otimes u_{j}$ and support $L$ , It follows from (5.1) that

for an arbitrary $V\in\partial\|S\|_{1}$ . Note that $\langle\hat{V}-V,\hat{\Sigma}^{\lambda}-S\rangle\geq 0$ by monotonicity of subdifferentials of convex functions and that the following representation holds

where $W$ is an arbitrary matrix with $\|W\|_{\infty}\leq 1$ . In particular, there exists $W$ with $\|W\|_{\infty}\leq 1$ such that

For this choice of $W$ , we get from (5.2) that

Using Cauchy-Schwarz’s inequality and trace duality, we get

The above display combined with (5.1) give

Finally, we get on the event $\lambda\geq 2\|\Delta_{1}\|_{\infty}$ that

We now prove the spectral norm bound. Note first that the solution of (1.5) is given by

Next, we have on the event $\lambda\geq 2\|\Delta_{1}\|_{\infty}$

2 Proof of Proposition 3

The delicate part of this proof is to obtain the sharp dependence on $\delta$ . As a consequence, the proof is significantly more technical as compared to the case of full observations $\delta=1$ . To simplify the understanding of this proof, we decomposed it into three lemmas that we prove below.

Now combining a simple union bound argument with Lemmas 2, 3 and 4, we get with probability at least $1-4e^{-t}$ that

Under the assumptions of Proposition 3, we have with probability at least $1-e^{-t}$ that

Next, since the random variables $\delta_{i,j}$ and $X_{i}^{(j)}$ are sub-gaussian for any $i,j$ , we have

where we have used Assumption 1 in the last inequality. We can apply Bernstein’s inequality (see Proposition 1 in the appendix below) to get for any $1\leq j\leq p$ with probability at least $1-e^{-t^{\prime}}$ that

where $C>0$ is an absolute constant. Next, taking $t^{\prime}=t+\log p$ combined with a union bound argument we get the result.

Under the assumptions of Proposition 3, we have with probability at least $1-2e^{-t}$ that

where $C>0$ is a large enough absolute constant.

where we have applied Cauchy-Schwarz’s inequality.

We have again by Cauchy-Schwarz’s inequality and Assumption 1 that

for some absolute constant $C>0$ . We have also, in view of (3.3), with the same absolute constant $C$ as above

Combining the three above displays with (5.2), we get

Combining the two above displays with (5.7), we get

where we have used that $\|Z\|_{\infty}\leq\|Y\otimes Y\|_{\infty}=|Y|_{2}^{2}$ and

Then, combining Proposition 1 with a union bound argument gives for any $t>0$

where $C^{\prime}>0$ is a large enough absolute constant.

where we have used Proposition 2 to get that

Under the assumptions of Proposition 3, we have with probability at least $1-e^{-t}$ that

proof. In view of Assumption 1, we have for any $1\leq j\leq p$ that $\|(Y^{(j)})^{2}\|_{\psi_{1}}\leq\|(X^{(j)})^{2}\|_{\psi_{1}}\leq 2\|X^{(j)}\|_{\psi_{2}}^{2}\leq 2c_{1}^{-1}\Sigma_{jj}$ and

Then we can apply Proposition 1 to get the result.

3 Proof of Lemma 1

In view of Proposition 3, we have on an event $\mathcal{A}$ of probability at least $1-\frac{1}{2p}$ that

We assume further that (3.7) is satisfied with a sufficiently large constant $c$ so that we have, in view of (3.5) and (3.6), on the same event $\mathcal{A}$ that

We immediately get on the event $\mathcal{A}$ that

Combining these simple facts with (5.11), we get the result.

4 Proof of Theorem 2

This proof uses standard tools of the minimax theory (cf. for instance ). However, as for Proposition 3, the proof with missing observations ( $\delta<1$ ) is significantly more technical as compared to case of full observations ( $\delta=1$ ). In particular, the control of the Kullback-Leibler divergence requires a precise description of the conditional distributions of the random variables $Y_{1},\ldots,Y_{n}$ given the masked variables $\delta_{1},\ldots,\delta_{n}$ . To our knowledge, there exists no minimax lower bound result for statistical problem with missing observations in the literature.

Set $\gamma=a/\sqrt{\delta^{2}n}$ where $a>0$ is a sufficiently small absolute constant.

We consider first the case $r\geq 2$ . Define

Set $B_{k,l}=E_{k,l}+E_{l,k}$ for any $1\leq k\leq r-1,\,k+1\leq l\leq r$ . Consider the associated set of symmetric matrices

Note that any matrix $\Sigma_{\epsilon}\in\mathcal{B}(\mathcal{N})$ is positive-semidefinite if $0<a<1$ since we have by assumption

in view of the condition $n\geq\delta^{-2}r^{2}$ . A similar reasoning gives the lower bound.

Using that $Y_{i}\mid\delta_{i}\sim N(0,\Sigma^{(\delta_{i})})$ with $\Sigma^{(\delta_{i})}$ defined in (5.14), we get for any $1\leq i\leq n$ , any $\Sigma\in{\cal A}^{0}$ and any realization $\delta_{i}(\omega)\in\{0,1\}^{p}$ that

We note that $P_{i}A_{0}^{(\delta_{i})}P_{i}^{*}=I_{d_{i}}$ and

Denote by $\lambda_{1},\ldots,\lambda_{d_{i}}$ the eigenvalues of $W_{i}$ . Note that $|\lambda_{j}|<1/2$ for any $j=1,\ldots,d_{i}$ in view of (5.12) if $a<1/2$ . We get, using the inequality $x-\log(1+x)\leq x^{2}$ for any $x>-1/2$ , that

Taking the expectation w.r.t. to $\delta_{i}$ in the above display, we get for any $1\leq i\leq n$ that

Thus, we deduce from the above display that the condition

is satisfied for any $\alpha>0$ if $a>0$ is chosen as a sufficiently small numerical constant depending on $\alpha$ . In view of (5.4) and (5.17), (4.1) now follows by application of Theorem 2.5 in .

The lower bound (4.2) follows from (4.1) by the following simple argument. Consider the set of matrices $\mathcal{A}^{0}$ . For any two distinct matrices $\Sigma_{1},\Sigma_{2}$ of $\mathcal{A}^{0}$ , we have

Next, (5.17) is satisfied for any $\alpha>0$ if $a>0$ is chosen as a sufficiently small numerical constant depending on $\alpha$ .

Combining (5.18) with (5.17) and Theorem 2.5 in gives the result.

for some absolute constant $c>0$ . The rest of the proof is identical to the case $r\geq 2$ .

Acknowledgment:

I wish to thank Professor Vladimir Koltchinskii for suggesting this problem and the observation (1.3).