Inference and Uncertainty Quantification for Noisy Matrix Completion
Yuxin Chen, Jianqing Fan, Cong Ma, Yuling Yan
Introduction
Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed to us [Sre04, CR09, KMO10a]. Tackling this problem in large-scale applications is computationally challenging, due to the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are contaminated by a certain amount of noise.
Fortunately, if the entries of the unknown matrix are sufficiently de-localized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms — including both convex and nonconvex approaches — that allow to fill in unseen entries faithfully given only partial noisy samples [CP10, NW12, KLT11, KMO10b, CW15, MWCC17, CCF+19]. Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how to use it to compute a short interval that is likely to contain a missing entry?
Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is generally very challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.
It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems in high-dimensional statistics, including Lasso [ZZ14, vdGBRD14, JM14a], generalized linear models [vdGBRD14, NL17, BFL+18], graphical models [JVDG15, RSZZ15, MLL17]), amongst others. Very little work, however, has looked into noisy matrix completion along this direction. While non-asymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, most, if not all, of the estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees, or come with an unspecified (but often enormous) pre-constant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial over-coverage.
2 A glimpse of our contributions
This paper takes a substantial step towards efficient inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the commonly used convex and nonconvex estimators. The resulting de-biased estimators admit nearly accurate non-asymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. While details of our main findings are postponed to Section 3, we would like to immediately single out a few important merits of the proposed inferential procedures and theory:
Our results enable two types of uncertainty assessment, namely, we can construct (i) confidence intervals for each entry — either observed or missing — of the unknown matrix; (ii) confidence regions for the low-rank factors of interest (modulo some unavoidable global ambiguity).
Despite the complicated statistical dependency, our procedure and theory do not rely on sample splitting, thus avoiding the unnecessary widening of confidence intervals / regions due to insufficient data usage.
The confidence intervals / regions constructed based on the proposed procedures are, in some sense, optimal.
We present a unified approach that accommodates both convex and nonconvex estimators seamlessly.
As a byproduct, we characterize the Euclidean estimation errors of the proposed de-biased estimators. Such error bounds are sharp and match an oracle lower bound precisely (including the pre-constant). To the best of our knowledge, this is the first theory that demonstrates that a computationally feasible algorithm can achieve the statistical limit including the pre-constant.
All of this is built upon the intimate link between convex and nonconvex estimators [CCF+19], as well as the recent advances in analyzing the stability of nonconvex optimization against random noise [MWCC17].
Models and notation
To cast the noisy matrix completion problem in concrete statistical settings, we adopt a model commonly studied in the literature [CR09]. We also introduce some useful notation.
Observation models.
What we observe is a random subset of noisy entries of ; more specifically, we observe
Incoherence conditions.
Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition [CR09, Che15] on the singular subspaces of (i.e. and ):
Asymptotic notation.
Here, (or ) means for some constant , means for some constant , means for some constants , and means . We write to indicate that for some small constant (much smaller than 1), and use to indicate that for some large constant (much larger than 1).
Inferential procedures and main results
The proposed inferential procedure lays its basis on two of the most popular estimation paradigms — convex relaxation and nonconvex optimization — designed for noisy matrix completion. Recognizing the complicated bias of these two highly nonlinear estimators, we shall first illustrate how to perform bias correction, followed by a distributional theory that establishes the near-Gaussianity and optimality of the proposed de-biased estimators.
We first review in passing two tractable estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion. The readers familiar with this literature can proceed directly to Section 3.2.
Recall that the rank function is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate [Faz02, RFP10]; for example, one can consider the following penalized least-squares convex program
or using our notation ,
Here, is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and is some regularization parameter. Under mild conditions, the solution to the convex program (3.1) provably attains near-optimal estimation accuracy (in an order-wise sense), provided that a proper regularization parameter is adopted [CCF+19].
Nonconvex optimization.
Intimate connections between convex and nonconvex estimates.
Denote by any minimizer of the convex program (3.1), and denote by the estimate returned by Algorithm 1 aimed at solving (3.4). As was recently shown in [CCF+19], when the regularization parameter is properly chosen, these two estimates obey (see (A.12) in Appendix A.2 for a precise statement)
2 Constructing de-biased estimators
We are now well equipped to describe how to construct new estimators based on the convex estimate and the nonconvex estimate , so as to enable statistical inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we shall abuse notation by using the shorthand for both convex and nonconvex estimates; see Table 1 and Appendix B for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.
Given that both (3.1) and (3.4) are regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from non-negligible bias. In order to enable desired statistical inference, it is natural to first correct the estimation bias.
A natural de-biasing strategy that immediately comes to mind is the following simple linear transformation (recall the notation in Table 1):
To remedy this issue, we propose to further project onto the set of rank- matrices, leading to the following de-biased estimator
The estimator (3.7) can be viewed as performing one iteration of singular value projection (SVP) [MJD09, DC18] on the current estimate .
The estimator (3.7) also bears a similarity to the de-biased estimator proposed by [Xia18] for low-rank trace regression; the disparity between them shall be discussed in Section 4.
An equivalent form: a de-shrunken estimator for the low-rank factors.
It turns out that the de-biased estimator (3.7) admits another almost equivalent representation that offers further insights. Specifically, we consider the following de-shrunken estimator for the low-rank factors
where we recall the definition of and in Table 1. To develop some intuition regarding why this is called a de-shrunken estimator, let us look at a simple scenario where is the SVD of and , . It is then self-evident that
In words, and are obtained by de-shrinking the spectrum of and properly.
As we shall formalize in Section 5.1, the de-shrunken estimator (3.8) for the low-rank factors is nearly equivalent to the de-biased estimator (3.7) for the whole matrix, in the sense that
3 Main results: distributional guarantees
The proposed estimators admit tractable distributional characterizations in the large- regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around two types of inferential problems:
Each entry of the matrix : the entry can be either missing (i.e. predicting an unseen entry) or observed (i.e. de-noising an observed entry). For example, in the Netflix challenge, one would like to infer a user’s preference about any movie, given partially revealed ratings [CR09]. Mathematically, this seeks to determine the distribution of
Suppose that the sample size and the noise obey
A more complete version can be found in Theorem 5.
Another interesting feature — which we shall make precise in the proof of this theorem — is that: for any given , the two random vectors and are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.
Theorem 1 is a non-asymptotic result. In words, Theorem 1 decomposes the estimation error (resp. ) into a Gaussian component (resp. ) and a residual term (resp. ). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to and . To see this, it is helpful to leverage the Gaussianity (3.15a) and compute that: for each , the th row of obeys
in other words, the typical size of the th row of is no smaller than the order of . In comparison, the size of each row of (see (3.16)) is much smaller than (and hence smaller than the size of the corresponding row of ) with high probability, provided that (3.13) is satisfied.
Equipped with the above master decompositions of the low-rank factors and Remark 4, we are ready to present a similar decomposition for the entry .
For each , define the variance as
where (resp. ) denotes the th (resp. th) row of (resp. ). Suppose that
where and the residual obeys with probability exceeding .
In the symmetric case where the noise , the truth , and the sampling pattern are all symmetric (i.e. \mathcal{P}_{\Omega}(\bm{E})=\big{(}\mathcal{P}_{\Omega}(\bm{E})\big{)}^{\top} and ), the variance (cf. (3.17)) for the diagonal entries has a different formula; more specifically, it is straightforward to extend our theory to show that
This additional multiplicative factor of 2 arises since and are identical (and hence not independent) in this symmetric case. The variance formula for any () remains unchanged.
Several remarks are in order. To begin with, we develop some intuition regarding where the variance comes from. By virtue of Theorem 1, one has the following Gaussian approximation
Assuming that the first-order expansion is reasonably tight, one has
According to Remark 4, and are nearly independent. It is thus straightforward to compute the variance of (3.20) as
Here, (i) relies on (3.20) and the near independence between and ; (ii) uses the variance formula in Theorem 1; (iii) arises from the definitions of and (cf. (2.2)). This computation explains (heuristically) the variance formula .
Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions (3.18), it in turn allows us to construct nearly accurate confidence intervals for each matrix entry . This is formally summarized in the following corollary, the proof of which is deferred to Appendix F. Here and throughout, we use to denote the interval .
Denote by the CDF of a standard Gaussian random variable and by its inverse function. Let
be the empirical estimate of the theoretical variance . Then one has
In words, Corollary 1 tells us that for any fixed significance level , the interval
is a nearly accurate two-sided confidence interval of .
In addition, we remark that when (and hence ), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that
Last but not least, the careful readers might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with for example: our conditions read
4 Lower bounds and optimality for inference
It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the de-biased estimator is optimal in some sense; for instance, it nearly attains the minimum covariance among all unbiased estimators. To formalize this claim, we shall
Quantify the performance of two ideal estimators with the assistance of an oracle;
Demonstrate that the performance of our de-biased estimators is arbitrarily close to that of the ideal estimators.
In what follows, we denote by (resp. ) the th row of (resp. ).
Suppose that there is an oracle informing us of , and that we observe the same set of data as in (2.3). Under such an idealistic setting and for any given , the following least-squares estimator achieves the minimum covariance among all unbiased estimators for the th row of (see e.g. [Sha03, Theorem 3.7])
In other words, for any unbiased estimator of (conditional on ), one has
where \mathsf{Cov}\big{(}\bm{X}_{i,\cdot}^{\mathsf{ideal}}\,\big{|}\,\Omega\big{)} is precisely the Cramér-Rao lower bound (conditional on ) under this ideal setting. As it turns out, with high probability, this lower bound concentrates around , as stated in the following lemma. The proof is postponed to Appendix H.1.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least , one has
Given that can be an arbitrarily small constant, Lemma 1 uncovers that the covariance of the de-shrunken estimator (cf. Theorem 1) matches that of the ideal estimator , thus achieving the Cramér-Rao lower bound with high probability. The same conclusion applies to as well.
Suppose that there is another oracle informing us of and ; that is, everything about except and everything about except . In addition, we observe the same set of data as in (2.3), except that we do not get to see .The exclusion of is merely for ease of presentation. One can consider the model where all with are observed with a slightly more complicated argument. Under this idealistic model, the Cramér-Rao lower bound [Sha03, Theorem 3.3] for estimating can be computed as
This means that any unbiased estimator of must have variance no smaller than . This quantity admits a much simpler lower bound as follows, whose proof can be found in Appendix H.2.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least ,
where is defined in Theorem 2.
Similar to Lemma 1, Lemma 2 reveals that the variance of our de-biased estimator (cf. Theorem 2) — which certainly does not have access to the side information provided by the oracle — is arbitrarily close to the Cramér-Rao lower bound aided by an oracle.
All in all, the above lower bounds demonstrate that the degrees of uncertainty underlying our de-shrunken low-rank factors and de-biased matrix are, in some sense, statistically minimal.
5 Back to estimation: the de-biased estimator is optimal
While the emphasis of the current paper is on inference, we would nevertheless like to single out an important consequence that informs the estimation step. To be specific, the decompositions and distributional guarantees derived in Theorem 1 and Theorem 2 allow us to track the estimation accuracy of , as stated in the following theorem. The proof of this result is postponed to Appendix G.
In stark contrast to prior statistical estimation guarantees (e.g. [CP10, NW12, KLT11, CCF+19]), Theorem 3 pins down the estimation error of the proposed de-biased estimator in a sharp manner (namely, even the pre-constant is fully determined). Encouragingly, there is a sense in which the proposed de-biased estimator achieves the best possible statistical estimation accuracy, as revealed by the following result.
Fix an arbitrarily small constant . Suppose that , and that . Then with probability exceeding , any unbiased estimator of obeys
Intuitively, the term reflects approximately the underlying degrees of freedom in the true subspace of interest (i.e. the tangent space of the rank- matrices at the truth ), whereas the factor captures the effect due to sub-sampling. This result has already been established in [CP10, Section III.B] (together with [CR09, Theorem 4.1]). We thus omit the proof for conciseness. The key idea is to consider an oracle informing us of the true tangent space . ∎
The implication of the above two theorems is remarkable: the de-biasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing the estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the pre-constant. As far as we know, this is the first theory about a polynomial time algorithm that matches the statistical limit in terms of the pre-constant. This intriguing finding is further corroborated by numerical experiments; see Section 3.6 for details (in particular, Figure 3).
6 Numerical experiments
We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 1 and Theorem 2. Note that our main results hold for the de-biased estimators built upon and . As we will formalize shortly in Section 5.1, these two de-biased estimators are extremely close to each other; see also Figure 4 for experimental evidence. Therefore, in order to save space, we use the de-biased estimator built upon the convex estimate throughout the experiments.
We begin by checking the validity of Theorem 1. Suppose that one is interested in estimating the inner product between and (). In the Netflix challenge, this might correspond to the similarity between the th user and the th one. As a straightforward consequence of Theorem 1, the normalized estimation error
is extremely close to a standard Gaussian random variable. Here, similar to (3.22), we let
be the empirical estimate of the theoretically predicted variance . As a result, a confidence interval of would be . For each , we define to be the empirical coverage rate of over Monte Carlo simulations. Correspondingly, denote by (resp. ) the average (resp. the standard deviation) of over indices . Table 2 collects the simulation results for different values of . As can be seen, the reported empirical coverage rates are reasonably close to the nominal level . In addition, Figure 1 depicts the Q-Q (quantile-quantile) plots of and vs. the standard Gaussian random variable over 200 Monte Carlo simulations for , and . It is clearly seen that all of these are well approximated by a standard Gaussian random variable.
Next, we turn to Theorem 2, namely the distributional guarantee for the entries of the matrix. Denote
where is the empirical variance defined in (3.22). In view of the confidence interval predicted by Corollary 1, and similar to what have done for the low-rank components, for each , we define to be the empirical coverage rate of over Monte Carlo simulations. Correspondingly, denote by (resp. ) the average (resp. the standard deviation) of over indices . As before, Table 3 gathers the empirical coverage rates for and Figure 2 displays the Q-Q (quantile-quantile) plots of and vs. the standard Gaussian random variable over 200 Monte Carlo trials for , and . It is evident that the distribution of matches that of reasonably well.
We conclude this section with experiments on real data. Similar to [CP10], we use the daily temperature data [NCD19] for 1400 stations across the world in 2018, which results in a data matrix. Inspection on the singular values reveals that the data matrix is nearly low-rank. We vary the observation probability from to and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain confidence intervals for all the entries. Table 4 reports the empirical coverage probabilities, the average length of the confidence intervals as well as the estimation error of both and over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to and the confidence intervals are also quite short. In addition, the estimation error of is smaller than that of , which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that (1) the underlying true temperature matrix is only approximately low-rank, and (2) the noise in the temperature might not be independent.
7 A bit of intuition
We pause to develop some intuition behind the distributional guarantees for the proposed estimators. Bearing in mind the intimate link between convex and nonconvex optimization (cf. (3.5)), it suffices to concentrate on the nonconvex problem (3.4). For the sake of clarity, we further restrict attention to the rank-1 positive semidefinite case where and set , where one can focus on
Any optimizer of (3.34) would necessarily satisfy the first-order optimality condition
We shall also assume that is a reasonably reliable estimate obeying .
We begin with the no-missing-data case (i.e. ), which already conveys the key insight. The condition (3.35) simplifies to
which, through a little manipulation, leads to an equivalent decomposition:
Then: (1) the third term of (3.37), which can be viewed as a second-order term (in the sense that it is a quadratic term of ), becomes vanishingly small when ; (2) while the second term of (3.37) looks like a first-order term, it is natural to conjecture that is sufficiently random and hence (i.e. the estimation error is not aligned with ), meaning that this term is also expected to be negligible compared to a typical first-order term (e.g. the term on the left-hand side of 3.37). In summary, these non-rigorous arguments suggest that
If one can be convinced that and are only weakly dependent, then this means
Returning to the missing data scenario with , everything is based on the following approximation
this is certainly expected — using standard concentration arguments — if we “pretend” that and are statistically independent. With this approximation in mind, one can translate (3.35) into
Repeating the above argument then immediately yields
The case with can be intuitively understood in a very similar way by first de-shrinking the estimate; we omit it here for brevity. We note that these hand-waving arguments can all be made rigorous, which is the main content of the proof.
8 Inference based on spectral estimates?
One would naturally be curious about whether there are other estimation procedures that also enable reasonable statistical inference. While this is beyond the scope of the current paper, we take a moment to discuss one alternative: the spectral method, as pioneered by [KMO10a, KMO10b] in the matrix completion problem. In a nutshell, this approach consists in computing a rank- approximation to , which is precisely the spectral initialization widely used in a two-stage nonconvex algorithm (cf. Algorithm 1) [KMO10b, SL16, CW15, CCF18, MWCC17]. While inference has not been, as far as we know, the focus of prior work on spectral methods,We note that inference from spectral estimates has been investigated in other context beyond matrix completion (e.g. the model without missing data [Xia19, FFHL19]). the recent papers [AFWZ17, MWCC17] hinted at the possibility of characterizing the distribution of the spectral estimate. Take a simple symmetric rank-1 case for example (i.e. with ): the leading eigenvector of often admits the following approximation (up to a global sign)
Expanding , we arrive at
In words, two major factors dictate the uncertainty of the spectral estimate: (1) the additive Gaussian noise (cf. the 1st term on the right-hand side of (3.42)), and (2) random sub-sampling (in particular, the randomness incurred by employing the sub-sampled to approximate the truth ). Given that the random sub-sampling effect cannot be ignored at all, the spectral estimates often suffer from a much larger estimation error (and hence a higher degree of uncertainty) compared to either the convex or the nonconvex estimates. In truth, this random sub-sampling effect does not go away even when the noise vanishes. Consequently, uncertainty quantification based on the spectral estimates may not be the most desirable option.
Prior art
Low-rank matrix completion, or more broadly, low-rank matrix recovery, is a fundamental task that permeates through a wide spectrum of applications in science, engineering, and finance (e.g. [RS05, SY07, CC14, FSZZ18, CCG15, ZPL15, BN06, CC18b, FWZ19, KS11, CZ16, KX15, FLM13, DR17, CDDD19, DPVW14, SZ12, FS11]). A paper of this length is unable to review all papers motivating and contributing to this enormous subject; interested readers are referred to [DR16, CC18a] for extensive discussions of motivating applications as well as the exciting recent development.
Numerous algorithms have been proposed to solve this problem efficiently, with two paradigms being arguably the most widely used: convex relaxation and nonconvex optimization. We briefly review the literature contributing to these two paradigms.
Nonconvex optimization algorithms, as pioneered by [KMO10a, Sre04], become increasingly more popular for solving various low-rank factorization problems, due to their appealing computational complexities [JNS13, CLS15, CC17, TBS+16, SL16, ZL16, CCFM19, WZG16, CLL19]. For instance, the gradient-based nonconvex methods have been analyzed for noisy matrix completion [KMO10b, CW15, MWCC17, CCF+19], which are shown to achieve near-optimal statistical accuracy and linear-time convergence guarantees all at once. Going beyond gradient methods, we note that other nonconvex methods (e.g. [RS05, JMD10, WYZ12, JNS13, FRW11, Van13, LXY13, Har14, JKN16, RT11, WCCL16, DC18, ZWL15, ZWYG18, MSL19, CCD+19]) and landscape properties [GLM16, CL17, GJZ17, ZSL19, ZJSL18, SXZ19] have been largely explored as well. The interested readers are referred to [CLC19] for an in-depth discussion. One limitation, however, is that the theoretical guarantees provided for nonconvex algorithms often exhibit sub-optimal dependency in the rank of the unknown matrix; for instance, most theory requires a sample complexity of at least (in fact, often much larger than ). This is outperformed by the convex relaxation approach.
Despite these recent developments, very little work has investigated statistical inference for noisy matrix completion. While [CKLN18, CKL16, CN15, CEGN15] discussed the construction of “honest” confidence regions, the volume of these regions is dependent on some (possibly huge) hidden constants, thus resulting in over-coverage. Perhaps the closest to our paper is the recent work [Xia18], which investigated inference for low-rank trace regression. Employing a closely related de-biased estimator with sample splitting, the paper [Xia18] established asymptotic normality of a certain projected distance between the estimate and the truth. The result therein, however, requires a sampling mechanism obeying the restricted isometry property (e.g. i.i.d. Gaussian designs), which fails to hold for matrix completion. Also, our approach does not require sample splitting — a technique that is convenient for analysis but conservative in constructing confidence regions. Another work by Cai et al. [CLR16] developed a unified approach to provide inference guarantees for linear inverse problems including low-rank matrix estimation. Their results, however, require the sample size to exceed the total dimension even under the Gaussian design. Finally, a recent line of work [MX17] explored uncertainty quantification under the Bayesian setting, hypothesizing on a special prior regarding the true matrix. This departs drastically from the scenario considered herein.
Inference in high-dimensional problems.
Inference in high-dimensional sparse regression has received much attention in the last few years [WR09, ZZ14, BCH11, vdGBRD14, JM14b, DBMM15, CG17, NL17, NNLL18, LSST16, LTTT14, MMB09, DBZ17, ZC17, BFL+18]. Our inferential approach is partly inspired by the recent developments on this topic, particularly with regard to the de-biased / de-sparsified estimators proposed for Lasso. More specifically, recognizing the non-negligible bias of the Lasso estimate
A line of work [ZZ14, vdGBRD14, JM14a] came up with a linear transformation of of the form
Interestingly, our de-biased estimator (3.7) for matrix completion admits a very similar form as (4.2). To see this, recall that our de-biased estimator is given by
where can be either or (see Table 1). Let be the tangent space of the set of rank- matrices at (resp. ) in the convex (resp. nonconvex) case, and be the projection operator onto . Somewhat surprisingly, replacing by does not affect the de-biased estimator by much, in the sense that
In addition, recognizing that almost lies within the tangent space ,More precisely, if , then ; if , one has .one can rewrite
Finally, de-biased estimators have been put forward to tackle other high-dimensional problems, including but not limited to generalized linear models [vdGBRD14, NL17], graphical models [JVDG15, RSZZ15, MLL17, JvdG17], sparse PCA [JvdG18], treatment effects estimation [CCD+18, AIW18]. These are beyond the scope of the current paper.
Architecture of the proof
This section outlines the main steps for establishing Theorem 1 and Theorem 2. Before starting, we introduce some useful notation. For convenience of presentation, we insert the factor into (3.4) and redefine the nonconvex loss function as
In addition, for each , we define the indicator , which is a Bernoulli random variable with mean .
We also note that Theorem 1 (resp. Theorem 2) is subsumed by Theorem 5 (resp. Theorem 6). As a result, we shall focus on establishing Theorem 5 (resp. Theorem 6) when it comes to estimating low-rank factors (resp. the entries of the matrix).
Instate the assumptions of Theorem 5. Recall the definition of in (3.17). Then one has the following decomposition
where and the residual obeys — with probability exceeding — that
Note that Theorem 5 and Theorem 6 are concerned with the de-biased estimators built upon both convex and nonconvex estimates. At first glance, one needs to establish theoretical guarantees for each of them separately. Fortunately, as alluded to previously (cf. (3.5)), the convex and nonconvex estimates are extremely close — a fact that has been established in [CCF+19]. The proximity of these two estimates naturally extends to the de-biased estimators constructed based on them. As a result, it suffices to concentrate on proving the theorems for any of these estimators; the claims for the other one follow immediately.
The following key lemma formalizes this argument, which will be established in Appendix C (see also Figure 4 for numerical evidence). Before continuing, we remind the readers of the key notation (see Appendix B for precise definitions):
: an approximate solution to the nonconvex problem (3.4) (see Appendix A.1);
: the de-biased estimators built upon the convex optimizer ;
: the de-biased estimators built upon the nonconvex estimate .
Suppose that the sample size obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Set with some large enough constant .
With probability at least , one has
where is the set of rotation matrices.
With probability exceeding , one has
In short, the first part of Lemma 3 tells us that
whereas the second part of Lemma 3 justifies that the proposed de-biased estimator is closely approximated by a linearized version (cf. (4.4)). Note that this linearized form bears a resemblance to the de-biased estimators developed for sparse linear regression [ZZ14, vdGBRD14, JM14a].
With Lemma 3 in place, we shall, from now on, focus on proving the main theorems for the nonconvex estimators, viz.
establishing Theorem 5 for the de-shrunken low-rank factors ;
establishing Theorem 6 for the de-biased matrix estimator .
To simplify the presentation hereafter, we shall use the following notation throughout the rest of this section:
: the nonconvex estimate ;
: the de-shrunken estimate defined in (3.8) based on ;
.
2 A precise characterization of the de-shrunken low-rank factors
We start with a precise characterization of the de-shrunken low-rank factors and , which paves the way for demonstrating both Theorem 5 and Theorem 6.
One has the following decompositions for and
which measures the imbalance between the low-rank factors and .
The claims follow from straightforward algebraic manipulations; see Appendix D.1. ∎
We make a few observations regarding Lemma 4. Take the decomposition of (5.8a) as an example:
First, the term vanishes when we have full observations, i.e. . Second, the terms involving and are both zero if is an exact stationary point of ; to see this, it is not hard to verify that any stationary point of necessarily satisfies , which in turn implies . Consequently, the last three terms in (5.8a) are expected to be small when is sufficiently large and is near a stationary point.
We shall make precise these arguments in subsequent subsections.
3 Taking global rotation into account
In order to invoke the decompositions of and (cf. Lemma 4) to characterize the estimation errors, we still need to incorporate the (unrecoverable) rotation matrix. From now on, we shall focus primarily on the factor . The claims on the other factor can be easily obtained via symmetry.
where we recall that is the rotation matrix that best aligns and (see (3.12)). Substituting the identity
into the decomposition (5.8a), we arrive at
4 Key lemmas for establishing Theorem 5
We now state five key lemmas. Taking these collectively and substituting them into (5.11) immediately establish Theorem 5.
We shall start by controlling the term as defined in (5.12).
Suppose that the sample complexity obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Then with probability at least , we have
Fix any . If the de-shrunken estimate were independent of the randomness in the th row of the matrix, i.e. , then would be well controlled. This hypothesis is certainly false, as clearly depends on . Nevertheless, by exploiting the leave-one-out technique recently used in [EKBB+13, EK15, AFWZ17, MWCC17, CFMW19, CCF+19, CLL19, DC18], one can properly decouple the dependency and establish the desired bound. See Appendix D.2.∎
The next lemma controls the size of . In essence, the term measures the difference between the estimate and the true signal ; the closer these two are, the smaller should be. See Appendix D.3 for the proof of the following result.
Suppose that the sample complexity obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Then with probability exceeding , one has
Moving on to and , one has the following lemmas.
Suppose that the sample complexity obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Then with probability exceeding , we have
It is straightforward to check that when , one has , where we recall the definition of in (5.7). Therefore, one expects to be small when is sufficiently large. See Appendix D.4.∎
Suppose that the sample complexity obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Then with probability at least , one has
It is easily seen that the size of depends on how close is to a stationary point of . For instance, in the extreme case where is an exact stationary point, then one would have . See Appendix D.5.∎
The last lemma asserts that is, in some sense, close to a zero-mean Gaussian random matrix with the desired covariance.
Suppose that the sample size obeys for some sufficiently large constant . Then one has the decomposition
Fix any . The th row, namely is conditionally Gaussian in the sense that
where we recall that . Recognize that the conditional covariance matrix concentrates sharply around its expectation, i.e. , which is the covariance matrix of that we are after. Hence, one can expect that is, marginally, not too far from a Gaussian random matrix. This argument can be carried out formally; see Appendix D.6. ∎
5 From low-rank factors to matrix entries (Proof of Theorem 6)
We now turn attention to inference on the matrix entries, by establishing Theorem 6. Towards this, we first make the following observation: for any ,
One can readily apply the decompositions in Theorem 5 to obtain
Take the preceding three identities collectively to reach
Following the same route as in Section 5.4, one can verify that is approximately Gaussian, whereas the residual term is small in magnitude. These claims are formally stated in the next two lemmas, with the proofs deferred to Appendix E.
Suppose that the sample complexity obeys for some sufficiently large constant and the noise satisfies for some sufficiently small constant . Then with probability exceeding , one has
Suppose that for some sufficiently large constant . Then we have the decomposition
where and the remaining term satisfies — with probability exceeding — that
Putting the above two lemmas together immediately establishes Theorem 6 and hence Theorem 2.
Discussion
The present paper makes progress towards inference and uncertainty quantification for noisy matrix completion, by developing simple de-biased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely sub-optimal in the following aspects:
Dependency on the rank and the condition number. To enable valid inference, our sample complexity (cf. (3.13) and (3.18a)) scales sub-optimally with the rank and the condition number . The sub-optimality can be understood through comparisons with the sample size requirement in the noise-free settings, which is independent of and matches the information limit (up to some factor). Improving such dependency calls for more refined analysis techniques.
Detection of the size of the entries. On one hand, when the size of the entry is moderately large (cf. (3.18b)), Corollary 1 allows us to construct a valid confidence interval for it. On the other hand, when vanishes, Theorem 5 tells us that the estimation error is better approximated by the inner product of two independent Gaussian random vectors. It remains to be seen how to determine whether is too small.
Low signal-to-noise (SNR) regime. Our theory operates under the moderate-to-high SNR regime, where (which is proportional to the SNR) is required to exceed the order of ; see the conditions in Theorem 5. It is unclear whether the connection between the convex and the nonconvex estimators hold in the low SNR regime. How to conduct inference in such a scenario is an important future direction.
In addition, our investigation has been dedicated to a natural random model, which by no means covers the most general settings of practical interest. There are numerous possible extensions that merit future investigation:
Approximate low-rank structure. Our current theory is built upon the exact low-rank structure of . Realistically, the matrix of interest is often only approximately low-rank. It is of great interest to study how to carry out statistical inference under such imperfect structural assumptions.
More general sampling patterns. This paper operates under the uniform random sampling assumption, which might sometimes be off in practical situations. It would be interesting to investigate whether our results in this paper can extend to more general non-uniform sampling patterns (e.g. [NW12]).
Extensions to robust PCA, sparse PCA, and 1-bit matrix completion. A variety of important extensions of matrix completion have been explored in prior literature, including but not limited to the case with sparse outliers (i.e. robust PCA [CLMW11, CSPW11]), the case where the matrix of interest is simultaneously sparse and low-rank (i.e. sparse PCA [ZHT06, CMW13]), and the case where only finite-bit observations are available (i.e. 1-bit matrix completion [DPVW14, CZ13]). Performing valid uncertainty assessment for these scenarios requires non-trivial extensions of the link between convex and nonconvex optimization.
Other loss functions. In the estimation stage, one might sometimes prefer other loss functions beyond the penalized squared loss. This might arise from either statistical considerations (e.g. employing a penalized Poisson log-likelihood to accommodate Poisson noise [CX16]), or computational concerns (e.g. adopting a non-smooth loss to improve convergence [CCD+19]). It would be of fundamental importance to develop a unified inferential framework that covers a broader family of loss functions.
Acknowledgements
Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton SEAS innovation award. J. Fan is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C. Ma is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.
Appendix A Preliminaries
In this section, we gather several notation and preliminary facts that are useful throughout the analysis. All the proofs, if needed, are deferred to Appendix I.
To begin with, we make precise the algorithm used to minimize the nonconvex loss function (5.1). Specifically, we describe the following details that are crucial for us to implement Algorithm 1:
Set the initial point to be or the spectral initialization as in [MWCC17, CLL19];
Set the stepsize ;
Set the maximum number of iterations to be ;
The returned estimate is , where
In words, we run gradient descent in Algorithm 1 until we reach a point whose gradient is exceedingly small.
Many of the preliminary facts below were established for the case [CCF+19], which is certainly not implementable in practice, however, it serves as a good proxy for studying the convex estimator. Fortunately, the same theoretical guarantees stated in Appendix A.2 can be readily established for spectral initialization using almost the same arguments adopted in [MWCC17, CLL19, CCF+19]. We omit this part mainly for the sake of brevity.
To facilitate analysis, we introduce a set of auxiliary nonconvex loss functions. For any , define
be the nonconvex estimate returned by this auxiliary algorithm (i.e. Algorithm 2), which serves as an approximate solution to (A.2).
A.2 Properties of approximate nonconvex solutions
This subsection gathers the properties of the (approximate) nonconvex solutions. Throughout this subsection, we use the shorthand
and recall the definition of in (A.3). The regularization parameter is chosen to satisfy
The claims stated below hold under the sample complexity and the noise condition presumed in [CCF+19, Theorem 1] (see also Theorem 5 in the current manuscript)
The first set of facts is related to . In view of [CCF+19], is a faithful estimateTechnically, the statements in [CCF+19, Lemma 5] are for and . Nevertheless, inspecting their proofs reveals that the claims continue to hold for our choices and . of , in the sense that
hold with probability exceeding . In addition, on the same high-probability event, one has
In words, the first claim ensures that is an approximate stationary point of ; the second bound tells us that is nearly balanced, in the sense that ; the last one formalizes the proximity between the convex solution and the nonconvex one; see also (3.5).
hold with probability at least .
As has been shown in [CCF+19], the leave-one-out auxiliary point satisfies
with probability exceeding .
Parallel to the transition from to , we set
to be the de-shrunken estimators of and , respectively. We shall demonstrate in Appendix I that, with probability at least ,
In addition to these four sets of claims, we have the following immediate consequence of the incoherence condition (2.4)
Moreover, recall that (cf. (5.7)). We obtain from the proof of [CCF+19, Lemma 8] that
Last but not least, we list a few simple but useful results: the nonconvex solution satisfies
The same holds true if we replace by either , or their corresponding low-rank factors. Here can vary from 1 to .
Appendix B Summary of the proposed estimators
Let be the minimizer of the convex program (3.1), and let be the solution returned by the Algorithm 1 (with algorithmic details specified in Appendix A.1). Recall that is the best rank- approximation of , viz.
In addition, we let the matrix estimate obtained by the nonconvex algorithm be . We further denote by the estimate of low-rank factors obtained by convex relaxation; more specifically, we set to be the balanced rank- factorization of obeying and . With these notations in place, our de-biased and de-shrunken estimators can be summarized as follows.
De-shrunken estimators for low-rank factors:
Appendix C Proof of Lemma 3
Throughout this section, let be the rank- SVD of the nonconvex estimate and the tangent space of the set of rank- matrices at . Correspondingly, we denote by the projection operator onto the tangent space , and let , where is the identity operator.
In essence, we intend to justify that , and are all very close to .
Recall from the definition of the de-biased estimator (cf. (B.1a)) that
Replacing by results in
Apply the proximity bound (A.12) to obtain (recall that in (A.12), one has )
as long as . In addition, in view of [CCF+19, Claim 2], one has the decomposition
where the middle line follows since is defined to be the SVD of .
where the last inequality results from (C.3) and (C.5). Combining the above two bounds with the fact that and are orthogonal to each other, we arrive at the conclusion that is the top- SVD of and
Second, let be the top- SVD of . By definition, one has . We are left with checking the two conditions in Lemma 14. To begin with, the perturbation term obeys
where (i) comes from (C.3) and (C.5) and the last inequality (ii) arises since Clearly, the size of the perturbation is much smaller than and hence (cf. (C.8a)). In addition,
where the equality depends on (C.8b) and the last line results from (C.7) and (C.9). Consequently,
Here the first relation arises from (C.8a) and the second holds since , a simple consequence of (A.19). We are now ready to apply Lemma 14 to obtain
where we have used the fact that , which also can be derived from (A.19). The above bound combined with (C.9) yields
as long as . We remark that by setting , one also obtains the bound on , i.e.
We move on to investigating , for which we have the following claim.
Taking the above three bounds collectively and recognizing that yield the advertised bound (5.4a).
The last inequality is the small-gradient condition (see (A.10), in which . Employ the definitions for and (cf. (B.2a) and (B.2b)) to see that
It then boils down to showing that (i) is very close to , and (ii) is small in size.
First, recall that , which combined with (C.12) gives
Here, the last inequality comes from (C.13) and its immediate consequence that .
Second, apply the perturbation bound for matrix square roots (cf. Lemma 13) to obtain
Here, the inequality (i) depends on the facts that
whereas the inequality (ii) holds because of the facts that , and the balancedness condition (A.11)
As a result, the operator norm of is bounded by
Take (C.14), (C.15) and (C.17) collectively to arrive at
C.2 Proof of the inequality (5.4b)
Next, we switch attention to the low-rank factors. Our goal is to demonstrate that and are both extremely close to modulo some global rotation, which will be established in (C.19) and (C.20) shortly.
We start by justifying the proximity between and . In view of (C.12), we know that
The latter bound follows from similar derivations as in (C.16). A similar bound holds for . Recognizing that
Next, we establish the connection between and . To accomplish this, we first study the relationship between and . Recall that and constitute a balanced factorization of , while is a balanced one of . Hence one can view as a perturbation of and investigate the perturbation bounds on the balanced factorizations. Going through the same derivations as in [MWCC17, Appendix B.7] and [CLL19, Appendix B.2.1], one reaches
Here the last relation follows from the proximity of the convex estimator and the nonconvex estimator; see (A.12). Repeating the same argument as above to translate the bound between and to that of and , we conclude that
This together with (C.19) and the assumption concludes the proof.
C.3 Proof of the inequality (5.5)
We shall focus on proving the claim for the nonconvex estimator and ; the claim for the convex estimator can be treated similarly.
To see this, it has been established in Appendix C.1 that
Appendix D Analysis of the low-rank factors
We concentrate on the factor ; the other factor can be treated similarly. By definition of the gradient, one has
with defined in (5.7), we can rearrange (D.1) as follows
By construction, the de-shrunken estimator satisfies the following property
where the last identity follows since and commute. Combining (D.3) with the identity (D.4) gives
Multiplying both sides of (D.5) by and recalling the definition of in (3.8), we have
Since and also commute, we have
where the last relation uses the definition of (see (3.8)).
Substituting the identity (D.7) back into (D.6) and making a few elementary algebraic manipulations yield the desired decomposition (5.8a).
D.2 Proof of Lemma 5
Recall that and similarly define
The triangle inequality tells us that for any fixed ,
In what follows, we shall control and separately.
To begin with, denoting results in
Before proceeding, we gather a few useful facts regarding , as summarized in the following claim.
With probability at least , we have
With the bounds on and in place, we are ready to control . By construction, is independent of . Therefore, the vector on the right-hand side of (D.8), , is a sum of conditionally independent random vectors. In particular, conditional on and , one has
Invoke the concentration inequality for Gaussian random vectors [HKZ12, Proposition 1.1] to see that
with probability at least . It remains to control , which we state in the following claim.
Suppose that . Then with probability exceeding ,
Combine the upper bound on with (D.10) and choose to arrive at
with probability exceeding .
We move on to bounding , for which we have
Here (i) uses the fact that (see [CCF+19, Lemma 3]), the perturbation bounds for pseudo-inverses (see Lemma 12) and (A.19); the penultimate inequality (ii) comes from the fact that (see (A.16c)) and last one uses the incoherence condition (cf. (A.17)).
Combine the bounds on and to reach
Taking the maximum over establishes our bound on .
Apply the perturbation bound for pseudo-inverses (see Lemma 12) to obtain
Here we have utilized the facts that (see (A.16a)) and a simple consequence of (A.19), viz.
Moreover, the triangle inequality tells us that
where the penultimate inequality follows from the facts that , (see (A.16b)) and that
Here the penultimate inequality follows from (A.19). The proof of the claim is then complete. ∎
Conditional on , using Bernstein’s inequality and the fact that and are independent, we arrive at that with probability exceeding ,
As a result, with probability at least , we have
as long as . Here the middle inequality uses Claim 2. In view of the triangle inequality,
with the proviso that . Again, the last line makes use of Claim 2. This concludes the proof of the claim. ∎
D.3 Proof of Lemma 6
where the second inequality follows from the incoherence assumption that (cf. (A.17)) and the fact that , a simple consequence of (A.19).
It remains to control . To simplify notation hereafter, define and . First, observe that
Second, in view of the decomposition of given in (5.8b), we have
The following claim connects with .
This relation together with (D.18) yields
A little algebraic manipulation then gives
It is easy to check from (A.19) that . Hence one can invoke Lemma 15 with , , and to obtain
where we have plugged in the definition of (see (D.19)) and used the identity . Combine the above inequality with (D.16) to obtain
It then boils down to controlling the above terms and .
First, the term can be upper bounded by
Here, the second line utilizes the facts that , and the results in (A.10), (A.13e) and (C.16).
Moving on to , we recall from (A.13b) that
Finally, for the term , by the triangle inequality one has
Observe that conditional on one has
As a result, we obtain that with probability at least ,
Here, the second relation uses the fact that
with probability at least as long as , which follows from [MWCC17, Lemma 38] or [CR09, Section 4.2] by observing that lies in the tangent space of . Take (D.21) and (D.22) collectively to reach
Substituting the bounds on , , and back to (D.20) results in
Taking the maximum over leads to the desired result.
Finally, we are left with proving Claim 4.
First, by , one can obtain
Second, since is the best rotation matrix to align and , we know from [MWCC17, Lemma 35] that
D.4 Proof of Lemma 7
Here (i) and (iv) rely on the unitary invariance of the operator norm, (ii) uses the definition of (see (3.8)) and (iii) follows from the choice and immediate consequences of (A.19)
Therefore, it suffices to control . To this end, we have the following decomposition
we can rewrite the first term of (D.25) as
Since is independent of the right hand side of the above equation can be viewed as a sum of independent random vectors, conditional on . Invoke Bernstein’s inequality to see that
holds with probability at least . Here, we denote
Here the penultimate inequality uses (A.14d) and the bound . We arrive at the conclusion that: with probability exceeding ,
Next, we move on to the second term of (D.25), which can be further decomposed as follows
In what follows, we bound and sequentially.
Regarding , using the definition of we obtain
where the second relation holds due to (A.14b) and the fact that (cf. (A.18)).
Moving on to , we can utilize the identity
With regards to , we have by Bernstein’s inequality and (A.14b) that
holds with probability exceeding . Here, we define
provided that . Here we apply the bounds and (see (A.19) and the following remarks). In the end, we turn to the term , which obeys
where the second line arises from the Cauchy-Schwarz inequality.
Take the previous bounds collectively to arrive at
as long as . Finally, we conclude that
D.5 Proof of Lemma 8
First, it is straightforward to verify that
where the last line arises from (A.10), the choice (cf. (A.6)), and the bounds
Here the latter two are immediate consequences of (A.19). Second, with regards to the term involving , we have
where the middle line uses (A.19) and the last one follows from (C.16).
Combine (D.27), (D.28) and the triangle inequality to establish the advertised result, with the proviso that .
D.6 Proof of Lemma 9
We invoke the identity (since ) to see that for any ,
consists of a sum of independent random vectors, where we recall that . In addition, the right-hand side of the above formula is conditionally Gaussian, namely,
Note that depends on the index through . Denote by the expectation of , that is,
Clearly, when , one has on the event and hence is well-defined. As a result, on the event , we have
As can be easily seen from (D.29) and (D.30), each row of follows the Gaussian distribution
It remains to show that, with high probability,
In what follows, we shall bound the two terms on the right-hand side of the above display sequentially.
First, observe that involves a sum of independent random vectors with
where denotes the sub-exponential norm [Ver17]. One can then apply the matrix Bernstein inequality [Kol11, Theorem 2.7] to conclude that with probability at least ,
Next, we move on to . Recall that on the event , one has
This together with the fact that gives
with the proviso that . Therefore, straightforward calculations yield
Here the second relation is the perturbation bound for the matrix square roots (see Lemma 13).
Combine the above two bounds to conclude that
with probability at least . Here the last line utilizes the matrix Bernstein inequality, where
Consequently with probability exceeding one has
Appendix E Analysis of the entries of the matrix
The term can be naturally split into two terms, namely
In what follows, we shall bound each term individually.
Regarding the first term, one sees from Theorem 5 that with probability exceeding
where the last line follows since and .
Turning to the second term, we have by the Cauchy-Schwarz inequality that
where the penultimate inequality uses (A.13d) and the last one depends on the incoherence assumption that (see (A.17)).
Take collectively the above two bounds to complete the proof.
E.2 Proof of Lemma 11
If and were independent, then clearly one would have
As such, the main ingredient of the proof boils down to demonstrating that and are nearly independent.
Fortunately, this weak dependency can be easily decoupled, for which we have the following claim.
Suppose that . One has the decomposition
where and is independent of and hence of . In addition, with probability at least one has
The desired result follows immediately from Claim 5, since
Similarly, repeating the same argument above, we can also show that can be decomposed as a Gaussian random variable as well as a residual term bounded above by with high probability. These together finish the proof.
Instate the notation used in Appendix D.6. Recall that
To remove the effect of on , we construct an auxiliary random matrix as follows
where ,
It is easily seen that ; more importantly is independent of and hence of .
which together with the triangle inequality and the fact yields
Here we have used the results in (D.32) and (D.33). We are left with bounding , for which we have
Take the above bound collectively with (D.33) to yield
as long as . As a result, we have
where the middle line relies on the perturbation of matrix square roots; see Lemma 13. Combining all, we arrive at
with the proviso that . This finishes the proof. ∎
Appendix F Proof of Corollary 1
This section is dedicated to establishing the following result, which subsumes Corollary 1 as a special case.
Suppose that the conditions (3.18) hold, and recall the notation in Corollary 1. Then one has
Before entering the main proof of Corollary 2, we make a simple observation that
where we recall that and . Here, the first inequality arises from (A.13d) and the second one uses the assumption on (i.e. (3.18b)). A simple consequence of (F.1) is that
which in conjunction with Theorem 6 yields the following decomposition
With this decomposition at hand, we have that for any ,
With probability exceeding , the term obeys
With Claim 6 at hand, we are ready to take
for any . This immediately establishes Corollary 2.
Suppose for the moment that for some . Then it follows immediately that
Therefore if suffices to control and (i.e. obtaining the quantity ).
First, expand and to see
where the middle line depends on (F.1) and (F.2), and the last inequality arises since .
Now we move on to . By the definition of , one has
Focusing on the factor, we have — with probability at least — that
Here, the first relation comes from the identity , and the inequality arises from the triangle inequality. Notice that and that
where the last inequality follows from (A.13b). Using the bounds (F.1) and (F.2), we continue the upper bound in (F.4) as follows
A similar bound holds for the factor . Therefore, with high probability we have
where the last relation results from the condition on (cf. (3.18b)).
Combine the bounds on and to see that with probability exceeding ,
This establishes the desired upper bound on . ∎
Appendix G Proof of Theorem 3
As we have argued in Section 5.1, it suffices to prove the claim for . For simplicity of notation, we define
Apply the decompositions in Theorem 5 to obtain
where we use the fact that . Theorem 5 tells us that , which further implies
Now we turn to the term , for which we have the following two claims.
With probability at least , one has
With probability exceeding , we have
Combine all of the above bounds to yield the desired result.
Plug in the definition of (cf. (G.1)) and invoke the triangle inequality again to see that
Here the last relation depends on the assumption (3.18a). Second, we have already established in this section that
with probability exceeding . Substitute the above two facts into (G.2) to arrive at
where \max\big{\{}\|\bm{\Delta}_{\bm{X}}\|_{2,\infty},\|\bm{\Delta}_{\bm{Y}}\|_{2,\infty}\big{\}}\lesssim\frac{\sigma}{\sqrt{p\sigma_{\min}}}\sqrt{\frac{\kappa^{2}\mu r^{2}\log^{2}n}{np}} and hence
Consequently, use the triangle inequality and Cauchy-Schwarz to verify that
where (i) follows since and , and (ii) makes use of (G.3) as well as the facts . In addition, invoke Lemma 9 to see that and are both Gaussian matrices with i.i.d. entries, which together with standard concentration results implies that
with the proviso that . This means that, with high probability,
Everything then boils down to controlling \mathsf{Tr}\big{(}\bm{Z}_{\bm{X},\bm{E}}\bm{Y}^{\star\top}\bm{Z}_{\bm{Y},\bm{E}}\bm{X}^{\star\top}\big{)}. Towards this end, we first note that
Similarly, . These identities allow us to derive
Apply the same arguments in controlling (D.21) to obtain that with probability at least ,
as long as . This combined with (G.6) yields the desired claim. ∎
Appendix H Proof of lower bounds
Fix any . It suffices to prove that the matrix defined in (3.27) satisfies
with probability at least , provided that . Towards this end, we first compute
where we recall that . Next, define the following event
where is some large absolute constant. On the event , in view of the fact , one has
with the proviso that . This further implies that
on the event . Clearly, the requirement (H.1) holds true if with .
To finish up, we are left with proving that occurs with probability at least . Invoke the matrix Bernstein inequality to show that
holds with probability at least , where we define
Here we have used the incoherence condition (A.17). Consequently, one reaches the conclusion that with probability exceeding ,
as long as , thus concluding the proof.
H.2 Proof of Lemma 2
The proof strategy is similar to the one used in proving Lemma 1 (cf. Appendix H.1). Fix any . It is sufficient to establish the following inequality
where the scalar is defined in (3.28) and is defined in Theorem 2. Expand the left-hand side to reach
where the last line follows from the observations that and .
where is some large universal constant. Two observations are sufficient to derive the desired the result (H.2). First, the event happens with probability at least — an easy consequence of the proof of Lemma 1 (cf. Appendix H.1). Second, on the event , repeating the same proof of Lemma 1 (cf. Appendix H.1), one can deduce that
Comparing (H.2) and (H.3), one arrives at the desired result as long as .
Appendix I Proofs in Section A
We start with (A.13a). Invoke the triangle inequality to get
where the last relation depends on the unitary invariance of the operator norm and (A.9b). It then boils down to controlling . Notice that
where the last inequality uses (cf. (A.19)), the fact that (see (A.6)), the bound (C.16) and the condition . Apply the perturbation bound for matrix square roots (see Lemma 13) to obtain that
Here, (i) uses the facts that and that , and (ii) follows from the condition that (see (A.6)). Combine the above two bounds with (cf. (A.19)) to reach
Moving on to (A.13b), we apply the triangle inequality and (I.3) to see that
In order to control , we leverage [MWCC17, Lemma 36] to get
where the last relation uses (I.2) and . Taking these bounds collectively yields
Now we turn attention to (A.13d). Observe that
where the last bound arises from (A.9c). Going through the same calculation as in bounding , we arrive at
as long as . We can thus continue the upper bound in (I.5) to derive
Here, the second line results from (I.4).
Finally, we deal with (A.13e). From the definition of the de-shrunken estimator (3.8), we have
This combined with the triangle inequality reveals that
Making use of (A.11) and (C.16) allows us to establish the claim.
I.2 Proof of the inequalities (A.16)
The proofs of (A.16a) and (A.16b) are the same as those of (A.13b) and (A.13d), and are hence omitted for conciseness. We are left with (A.16c). Denoting
as long as . Here the first inequality follows from (A.13a). In addition, we have
Regarding , one can apply the bound (C.16) for and a similar bound for to obtain
Returning to (I.6), one has by the triangle inequality that
we can apply the perturbation bound for matrix square roots (see Lemma 13) to obtain
where the penultimate relation uses (A.14a) as well as the fact that .
With the above bound in place, we are ready to invoke [CCF+19, Lemma 22] to obtain
where the last line comes from (A.14a). This concludes the proof.
Appendix J Technical lemmas
This section collects a few useful matrix perturbation bounds. The first one is concerned with the perturbation of pseudo-inverses.
Let (resp. ) be the pseudo-inverse (i.e. Moore–Penrose inverse) of (resp. ). Then we have
The next lemma focuses on the perturbation bound for matrix square roots.
Consider two symmetric matrices obeying and for some . Let (resp. ) be the (principal) matrix square root of (resp. ). Then one has
The following lemma concerns the perturbation of top- components of matrices.
From Wedin’s theorem [Wed72], there exist orthonormal matrices such that
In addition, Weyl’s inequality tells us that
Here, the second inequality follows from the triangle inequality and the assumption that . Expand and apply the triangle inequality to obtain
Once again, employ (J.1) and (J.2) to arrive at
The last bound centers around the well-known Sylvester equation .
as long as and for some .
To begin with, we intend to show that under the condition for some , there is a unique solution to the matrix equation . Use the notation of Kronecker product to obtain an equivalent form of as follows
where denotes the Kronecker product and stands for the vectorization of the matrix . Given that and , it is straightforward to see that is invertible, thus justifying the uniqueness of .
The next step is to characterize explicitly. The argument herein is adapted from [Smi68] and [Sch92]. Specifically, it has been shown in [Smi68] that the equation is equivalent to
where , and , for any . In particular, when , the matrix
is the unique solution to and hence to . To show this, it suffices to verify that the matrix series is convergent. Note that when , one has
and similarly . These two bounds taken together immediately establish the convergence of the matrix series (J.5).
In the end, the explicit representation (J.5) allows us to upper bound . A little algebra reveals that
where we make use of the fact . In addition, from the definition of we know that
provided that . Combine this with the bounds on and to reach