High-dimensional covariance matrix estimation with missing observations
Karim Lounici
Introduction
We can think of the as masked variables. If , then we cannot observe the -th component of and the default value is assigned to . Our goal is then to estimate given the partial observations .
The statistical problem of covariance estimation with missing observations is fundamental in multivariate statistics since it is often used as the first step to retrieve information in numerous applications where datasets with missing observations are common:
Climate studies: is the number of time points and the number of observations stations, which may sometimes fail to produce an observation due to break down of measure instruments. As a consequence, the generated datasets usually contain missing values.
Gene expression micro-arrays: is the number of measurements and the number of tested genes. Despite the improvement of genes expression techniques, the generated datasets frequently contain missing values with up to of genes affected.
Cosmology: is the number of images produced by a telescope and is the number of pixels per image. With the development of very large telescopes and wide sky surveys, the generated datasets are huge but usually contain missing observations due to partial sky coverage or defective pixel.
One simple strategy to deal with missing data is to exclude from the analysis any variable for which observations are missing, thus restricting the analysis to a subset of fully observed variables. In gene expression data where of the genes are affected by missing values, we would be left with too few variables so that the legitimacy of the statistical analysis becomes questionable. Also, discarding variables with very few missing observations is a waste of available information. Existing procedures involve complex imputation techniques to fill in the missing values through computationally intensive implementation of the EM algorithm, see and the references cited therein for more details. In this paper, we propose a simple procedure computationally tractable in high-dimension that does not require to imput missing observations or to discard any available observation to recover the covariance matrix .
Several alternatives have been studied in the literature to provide better estimates of the covariance matrix in the high-dimensional setting. A popular approach in Gaussian graphical models consists in estimating the inverse of the covariance matrix (called concentration matrix) since it admits a naturally sparse (or approximately sparse) structure if the dependence graph is itself sparse. See and the references cited therein for more details. A limitation of this approach is that it does not apply to low rank matrices since the concentration matrix does not exist in this case. An other popular approach assumes that the unknown covariance matrix is sparse,that is most of the entries are exactly or approximately zero and then proposes to perform either entrywise thresholding or tapering of the sample covariance matrix . Note that the sparsity notion adopted in this approach is not adapted to strongly correlated datasets with dense covariance matrix.
In random matrix theory, an important line of work, and the references cited therein studied the asymptotic distribution of the sample covariance matrix eigenvalues for different settings of and . See also for a very nice survey of existing non-asymptotic results on the spectral norm deviation of the sample covariance matrix from its population counterpart. In this paper, we adopt this approach and we will provide further details as we present our results.
Note that the results derived in the works cited above do not cover datasets with missing observations. For instance, when the data contains no missing observation (), established a non-asymptotic control on the stochastic deviation of the empirical covariance matrix provided some tails conditions are satisfied by the common distribution of . Exploiting these results, it is possible to establish oracle inequalities for the covariance version of the matrix Lasso estimator
where is the set of positive-semidefinite symmetric matrices, and are respectively the Frobenius and nuclear norm of and is a regularization parameter that should be chosen of the order of magnitude of . This estimator is the covariance version of the matrix Lasso estimator initially introduced in the matrix regression framework, see and the references cited therein. To the best of our knowledge, the procedure (1.2) has not been studied in the covariance estimation problem.
When the data contains missing observations (), we no longer have access to . Given the observations , we can build the following empirical covariance matrix
We present now our reconstruction procedure based on the following simple observation
Therefore, we can define the following unbiased estimator of when the data set contains missing observations
Our estimator is then solution of the following penalized empirical risk minimization problem:
The rest of the paper is organized as follows. In Section 2, we recall some tools and definitions. In Section 3, we establish oracle inequalities for the Frobenius and spectral norms for our procedure (1.5) and also propose a data-driven choice of the regularization parameter. In section 4, we establish minimax lower bounds for data with missing observations , thus showing that our procedures are minimax optimal up to a logarithmic factor. Finally, Section 5 contains all the proofs of the paper.
We emphasize that the results of this paper are non-asymptotic in nature, hold true for any setting of , are minimax optimal (up to a logarithmic factor) and do not require the unknown covariance matrix to be low-rank. We note also that to the best of our knowledge, there exists in the literature no minimax lower bound result for statistical problem with missing observations.
Tools and definitions
Denote by the set of symmetric positive-semidefinite matrices. Any matrix admits the following spectral representation
The Schatten -norm of is defined by
We will also use the fact that the subdifferential of the convex function is the following set of matrices :
We recall now the definition and some basic properties of sub-exponential random vectors.
The -norms of a real-valued random variable are defined by
We recall some well-known properties of sub-exponential random variables:
For any real-valued random variable such that for some , we have
If a real-valued random variable is sub-gaussian, then is sub-exponential. Indeed, we have
We recall the Bernstein inequality for sub-exponential real-valued random variables (see for instance Corollary 5.17 in )
Let be independent centered sub-exponential random variables, and . Then for every , we have with probability at least
The following proposition is the matrix version of Bernstein’s inequality for bounded random matrices (see also Corollary 9.1 in ).
Then, for all with probability at least we have
Oracle inequalities
We can now state the main result for the procedure (1.5).
The intrinsic dimension of the matrix can be measured by the effective rank
We have the following result, which requires no condition on the covariance matrix .
The natural choice for is of the order of magnitude . Then the conclusions of Proposition 3 hold true with probability at least . In addition, if the number of measurements is sufficiently large
where is a sufficiently large numerical constant, then an acceptable choice for the regularization parameter is
where the absolute constant is sufficiently large.
As we claimed in the introduction, Proposition 3 requires no condition on whatsoever. However, for the result to be of any practical interest, we need the bound in (3.5) to be small, which is the case if the condition (3.7) is satisfied. This condition is interesting since it shows that the number of measurements sufficient to guarantee a precise enough estimation of the spectrum of grows with the effective rank . In particular, when no observation is missing (), if is approximately low-rank so that , then only measurements are sufficient to estimate precisely the spectrum of the covariance matrix .
Note that if we assume that a.s. for some constant , then we can eliminate the factor in (3.5). Consequently, we can replace the condition (3.7) on the number of measurements by the following less restrictive one
for some absolute constant sufficiently large. When there is no missing observation (), we obtain the standard condition on the number of measurements (see Remark 5.53 in ). When some observations are missing (), we have the additional quantity in the denominators of (3.5) and (3.7). The bound (3.5) is degraded in the case since we observe less entries per measurement. Consequently, as we can see it in (3.7), if we denote by the number of necessary measurements to estimate with a precision when no observation is missing (), then we will need at least measurements in order to estimate with the same precision when some observations are missing (). In Theorem 2, we prove in particular that the dependence of the bound (3.5) on is sharp by establishing a minimax lower bound.
Proposition 3 and Equation (3.8) give some insight on the tuning of the regularization parameter:
Let the assumptions of Proposition 3 be satisfied. Assume in addition that (3.7) holds true. Take as in (3.9) with a large enough constant that can depend only on . Then, we have with probability at least that
where can depend only on .
We obtain the following corollary of Theorem 1.
Let Assumption 1 be satisfied. Assume that (3.7) is satisfied. Consider the estimator (1.5) with the regularization parameter satisfying (3.9). Then we have, with probability at least that
where can depend only on .
The proof of this corollary is immediate by combining Theorem 1 with Proposition 3 and Lemma 1.
Lower bounds
We now establish a minimax lower bound that guarantees the rates we obtained in Corollary 1 are optimal up to a logarithmic factor on the probability distribution class . In particular, the dependence of our rates on , and is sharp.
where is an i.i.d. sequence of Bernoulli random variables independent of .
Then, there exist absolute constants and such that
where denotes the infimum over all possible estimators of based on .
Proofs
The proof of the first inequality adapts to covariance matrix estimation the arguments used in the trace regression problem to prove Theorems 1 and 11 in .
proof. By definition of , we have for any
Next, a necessary and sufficient condition of minimum for problem (1.5) implies that there exists such that for all
For any of rank with spectral representation and support , It follows from (5.1) that
for an arbitrary . Note that by monotonicity of subdifferentials of convex functions and that the following representation holds
where is an arbitrary matrix with . In particular, there exists with such that
For this choice of , we get from (5.2) that
Using Cauchy-Schwarz’s inequality and trace duality, we get
The above display combined with (5.1) give
Finally, we get on the event that
We now prove the spectral norm bound. Note first that the solution of (1.5) is given by
Next, we have on the event
2 Proof of Proposition 3
The delicate part of this proof is to obtain the sharp dependence on . As a consequence, the proof is significantly more technical as compared to the case of full observations . To simplify the understanding of this proof, we decomposed it into three lemmas that we prove below.
Now combining a simple union bound argument with Lemmas 2, 3 and 4, we get with probability at least that
Under the assumptions of Proposition 3, we have with probability at least that
Next, since the random variables and are sub-gaussian for any , we have
where we have used Assumption 1 in the last inequality. We can apply Bernstein’s inequality (see Proposition 1 in the appendix below) to get for any with probability at least that
where is an absolute constant. Next, taking combined with a union bound argument we get the result.
Under the assumptions of Proposition 3, we have with probability at least that
where is a large enough absolute constant.
where we have applied Cauchy-Schwarz’s inequality.
We have again by Cauchy-Schwarz’s inequality and Assumption 1 that
for some absolute constant . We have also, in view of (3.3), with the same absolute constant as above
Combining the three above displays with (5.2), we get
Combining the two above displays with (5.7), we get
where we have used that and
Then, combining Proposition 1 with a union bound argument gives for any
where is a large enough absolute constant.
where we have used Proposition 2 to get that
Under the assumptions of Proposition 3, we have with probability at least that
proof. In view of Assumption 1, we have for any that and
Then we can apply Proposition 1 to get the result.
3 Proof of Lemma 1
In view of Proposition 3, we have on an event of probability at least that
We assume further that (3.7) is satisfied with a sufficiently large constant so that we have, in view of (3.5) and (3.6), on the same event that
We immediately get on the event that
Combining these simple facts with (5.11), we get the result.
4 Proof of Theorem 2
This proof uses standard tools of the minimax theory (cf. for instance ). However, as for Proposition 3, the proof with missing observations () is significantly more technical as compared to case of full observations (). In particular, the control of the Kullback-Leibler divergence requires a precise description of the conditional distributions of the random variables given the masked variables . To our knowledge, there exists no minimax lower bound result for statistical problem with missing observations in the literature.
Set where is a sufficiently small absolute constant.
We consider first the case . Define
Set for any . Consider the associated set of symmetric matrices
Note that any matrix is positive-semidefinite if since we have by assumption
in view of the condition . A similar reasoning gives the lower bound.
Using that with defined in (5.14), we get for any , any and any realization that
We note that and
Denote by the eigenvalues of . Note that for any in view of (5.12) if . We get, using the inequality for any , that
Taking the expectation w.r.t. to in the above display, we get for any that
Thus, we deduce from the above display that the condition
is satisfied for any if is chosen as a sufficiently small numerical constant depending on . In view of (5.4) and (5.17), (4.1) now follows by application of Theorem 2.5 in .
The lower bound (4.2) follows from (4.1) by the following simple argument. Consider the set of matrices . For any two distinct matrices of , we have
Next, (5.17) is satisfied for any if is chosen as a sufficiently small numerical constant depending on .
Combining (5.18) with (5.17) and Theorem 2.5 in gives the result.
for some absolute constant . The rest of the proof is identical to the case .
Acknowledgment:
I wish to thank Professor Vladimir Koltchinskii for suggesting this problem and the observation (1.3).