Inference and Uncertainty Quantification for Noisy Matrix Completion

Yuxin Chen, Jianqing Fan, Cong Ma, Yuling Yan

Introduction

Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed to us [Sre04, CR09, KMO10a]. Tackling this problem in large-scale applications is computationally challenging, due to the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are contaminated by a certain amount of noise.

Fortunately, if the entries of the unknown matrix are sufficiently de-localized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms — including both convex and nonconvex approaches — that allow to fill in unseen entries faithfully given only partial noisy samples [CP10, NW12, KLT11, KMO10b, CW15, MWCC17, CCF+19]. Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how to use it to compute a short interval that is likely to contain a missing entry?

Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is generally very challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.

It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems in high-dimensional statistics, including Lasso [ZZ14, vdGBRD14, JM14a], generalized linear models [vdGBRD14, NL17, BFL+18], graphical models [JVDG15, RSZZ15, MLL17]), amongst others. Very little work, however, has looked into noisy matrix completion along this direction. While non-asymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, most, if not all, of the estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees, or come with an unspecified (but often enormous) pre-constant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial over-coverage.

2 A glimpse of our contributions

This paper takes a substantial step towards efficient inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the commonly used convex and nonconvex estimators. The resulting de-biased estimators admit nearly accurate non-asymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. While details of our main findings are postponed to Section 3, we would like to immediately single out a few important merits of the proposed inferential procedures and theory:

Our results enable two types of uncertainty assessment, namely, we can construct (i) confidence intervals for each entry — either observed or missing — of the unknown matrix; (ii) confidence regions for the low-rank factors of interest (modulo some unavoidable global ambiguity).

Despite the complicated statistical dependency, our procedure and theory do not rely on sample splitting, thus avoiding the unnecessary widening of confidence intervals / regions due to insufficient data usage.

The confidence intervals / regions constructed based on the proposed procedures are, in some sense, optimal.

We present a unified approach that accommodates both convex and nonconvex estimators seamlessly.

As a byproduct, we characterize the Euclidean estimation errors of the proposed de-biased estimators. Such error bounds are sharp and match an oracle lower bound precisely (including the pre-constant). To the best of our knowledge, this is the first theory that demonstrates that a computationally feasible algorithm can achieve the statistical limit including the pre-constant.

All of this is built upon the intimate link between convex and nonconvex estimators [CCF+19], as well as the recent advances in analyzing the stability of nonconvex optimization against random noise [MWCC17].

Models and notation

To cast the noisy matrix completion problem in concrete statistical settings, we adopt a model commonly studied in the literature [CR09]. We also introduce some useful notation.

Observation models.

What we observe is a random subset of noisy entries of M\bm{M}^{\star}; more specifically, we observe

Incoherence conditions.

Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition [CR09, Che15] on the singular subspaces of M\bm{M}^{\star} (i.e. U\bm{U}^{\star} and V\bm{V}^{\star}):

Asymptotic notation.

Here, f(n)h(n)f(n)\lesssim h(n) (or f(n)=O(h(n))f(n)=O(h(n))) means f(n)c1h(n)|f(n)|\leq c_{1}|h(n)| for some constant c1>0c_{1}>0, f(n)h(n)f(n)\gtrsim h(n) means f(n)c2h(n)|f(n)|\geq c_{2}|h(n)| for some constant c2>0c_{2}>0, f(n)h(n)f(n)\asymp h(n) means c2h(n)f(n)c1h(n)c_{2}|h(n)|\leq|f(n)|\leq c_{1}|h(n)| for some constants c1,c2>0c_{1},c_{2}>0, and f(n)=o(h(n))f(n)=o(h(n)) means limnf(n)/h(n)=0\lim_{n\rightarrow\infty}f(n)/h(n)=0. We write f(n)h(n)f(n)\ll h(n) to indicate that f(n)c1h(n)|f(n)|\leq c_{1}|h(n)| for some small constant c1>0c_{1}>0 (much smaller than 1), and use f(n)h(n)f(n)\gg h(n) to indicate that f(n)c2h(n)|f(n)|\geq c_{2}|h(n)| for some large constant c2>0c_{2}>0 (much larger than 1).

Inferential procedures and main results

The proposed inferential procedure lays its basis on two of the most popular estimation paradigms — convex relaxation and nonconvex optimization — designed for noisy matrix completion. Recognizing the complicated bias of these two highly nonlinear estimators, we shall first illustrate how to perform bias correction, followed by a distributional theory that establishes the near-Gaussianity and optimality of the proposed de-biased estimators.

We first review in passing two tractable estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion. The readers familiar with this literature can proceed directly to Section 3.2.

Recall that the rank function rank()\mathsf{rank}(\cdot) is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate [Faz02, RFP10]; for example, one can consider the following penalized least-squares convex program

or using our notation PΩ\mathcal{P}_{\Omega},

Here, \|\cdot\|_{*} is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and λ>0\lambda>0 is some regularization parameter. Under mild conditions, the solution to the convex program (3.1) provably attains near-optimal estimation accuracy (in an order-wise sense), provided that a proper regularization parameter λ\lambda is adopted [CCF+19].

Nonconvex optimization.

Intimate connections between convex and nonconvex estimates.

Denote by Zcvx\bm{Z}^{\mathsf{cvx}} any minimizer of the convex program (3.1), and denote by (Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}) the estimate returned by Algorithm 1 aimed at solving (3.4). As was recently shown in [CCF+19], when the regularization parameter λ\lambda is properly chosen, these two estimates obey (see (A.12) in Appendix A.2 for a precise statement)

2 Constructing de-biased estimators

We are now well equipped to describe how to construct new estimators based on the convex estimate Zcvx\bm{Z}^{\mathsf{cvx}} and the nonconvex estimate (Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}), so as to enable statistical inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we shall abuse notation by using the shorthand Z,X,Y\bm{Z},\bm{X},\bm{Y} for both convex and nonconvex estimates; see Table 1 and Appendix B for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.

Given that both (3.1) and (3.4) are regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from non-negligible bias. In order to enable desired statistical inference, it is natural to first correct the estimation bias.

A natural de-biasing strategy that immediately comes to mind is the following simple linear transformation (recall the notation in Table 1):

To remedy this issue, we propose to further project Z0\bm{Z}^{0} onto the set of rank-rr matrices, leading to the following de-biased estimator

The estimator (3.7) can be viewed as performing one iteration of singular value projection (SVP) [MJD09, DC18] on the current estimate Z\bm{Z}.

The estimator (3.7) also bears a similarity to the de-biased estimator proposed by [Xia18] for low-rank trace regression; the disparity between them shall be discussed in Section 4.

An equivalent form: a de-shrunken estimator for the low-rank factors.

It turns out that the de-biased estimator (3.7) admits another almost equivalent representation that offers further insights. Specifically, we consider the following de-shrunken estimator for the low-rank factors

where we recall the definition of X\bm{X} and Y\bm{Y} in Table 1. To develop some intuition regarding why this is called a de-shrunken estimator, let us look at a simple scenario where UΣV\bm{U}\bm{\Sigma}\bm{V}^{\top} is the SVD of XY\bm{X}\bm{Y}^{\top} and X=UΣ1/2\bm{X}=\bm{U}\bm{\Sigma}^{1/2}, Y=VΣ1/2\bm{Y}=\bm{V}\bm{\Sigma}^{1/2}. It is then self-evident that

In words, Xd\bm{X}^{\mathsf{d}} and Yd\bm{Y}^{\mathsf{d}} are obtained by de-shrinking the spectrum of X\bm{X} and Y\bm{Y} properly.

As we shall formalize in Section 5.1, the de-shrunken estimator (3.8) for the low-rank factors is nearly equivalent to the de-biased estimator (3.7) for the whole matrix, in the sense that

3 Main results: distributional guarantees

The proposed estimators admit tractable distributional characterizations in the large-nn regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around two types of inferential problems:

Each entry of the matrix M\bm{M}^{\star}: the entry can be either missing (i.e. predicting an unseen entry) or observed (i.e. de-noising an observed entry). For example, in the Netflix challenge, one would like to infer a user’s preference about any movie, given partially revealed ratings [CR09]. Mathematically, this seeks to determine the distribution of

Suppose that the sample size and the noise obey

A more complete version can be found in Theorem 5.

Another interesting feature — which we shall make precise in the proof of this theorem — is that: for any given 1i,jn1\leq i,j\leq n, the two random vectors ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.

Theorem 1 is a non-asymptotic result. In words, Theorem 1 decomposes the estimation error XdHdX\bm{X}^{\mathsf{d}}\bm{H}^{\mathsf{d}}-\bm{X}^{\star} (resp. YdHdY\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}}-\bm{Y}^{\star}) into a Gaussian component ZX\bm{Z}_{\bm{X}} (resp. ZY\bm{Z}_{\bm{Y}}) and a residual term ΨX\bm{\Psi}_{\bm{X}} (resp. ΨY\bm{\Psi}_{\bm{Y}}). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to ZX\bm{Z}_{\bm{X}} and ZY\bm{Z}_{\bm{Y}}. To see this, it is helpful to leverage the Gaussianity (3.15a) and compute that: for each 1jn1\leq j\leq n, the jjth row of ZX\bm{Z}_{\bm{X}} obeys

in other words, the typical size of the jjth row of ZX\bm{Z}_{\bm{X}} is no smaller than the order of σr/(pσmax)\sigma\sqrt{r/(p\sigma_{\max})}. In comparison, the size of each row of ΨX\bm{\Psi}_{\bm{X}} (see (3.16)) is much smaller than σr/(pσmax)\sigma\sqrt{r/(p\sigma_{\max})} (and hence smaller than the size of the corresponding row of ZX\bm{Z}_{\bm{X}}) with high probability, provided that (3.13) is satisfied.

Equipped with the above master decompositions of the low-rank factors and Remark 4, we are ready to present a similar decomposition for the entry MijdMijM_{ij}^{\mathsf{d}}-M_{ij}^{\star}.

For each 1i,jn1\leq i,j\leq n, define the variance vijv_{ij}^{\star} as

where Ui,\bm{U}_{i,\cdot}^{\star} (resp. Vj,\bm{V}_{j,\cdot}^{\star}) denotes the iith (resp. jjth) row of U\bm{U}^{\star} (resp. V\bm{V}^{\star}). Suppose that

where gijN(0,vij)g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star}) and the residual obeys Δij=o(vij)|\Delta_{ij}|=o(\sqrt{v_{ij}^{\star}}) with probability exceeding 1O(n3)1-O(n^{-3}).

In the symmetric case where the noise E\bm{E}, the truth M\bm{M}^{\star}, and the sampling pattern are all symmetric (i.e. \mathcal{P}_{\Omega}(\bm{E})=\big{(}\mathcal{P}_{\Omega}(\bm{E})\big{)}^{\top} and M=M\bm{M}^{\star}=\bm{M}^{\star\top}), the variance viiv_{ii}^{\star} (cf. (3.17)) for the diagonal entries has a different formula; more specifically, it is straightforward to extend our theory to show that

This additional multiplicative factor of 2 arises since ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYei\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{i} are identical (and hence not independent) in this symmetric case. The variance formula for any vijv_{ij}^{\star} (iji\neq j) remains unchanged.

Several remarks are in order. To begin with, we develop some intuition regarding where the variance vijv_{ij}^{\star} comes from. By virtue of Theorem 1, one has the following Gaussian approximation

Assuming that the first-order expansion is reasonably tight, one has

According to Remark 4, ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} are nearly independent. It is thus straightforward to compute the variance of (3.20) as

Here, (i) relies on (3.20) and the near independence between ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}; (ii) uses the variance formula in Theorem 1; (iii) arises from the definitions of X\bm{X}^{\star} and Y\bm{Y}^{\star} (cf. (2.2)). This computation explains (heuristically) the variance formula vijv_{ij}^{\star}.

Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions (3.18), it in turn allows us to construct nearly accurate confidence intervals for each matrix entry MijM_{ij}^{\star}. This is formally summarized in the following corollary, the proof of which is deferred to Appendix F. Here and throughout, we use [a±b][a\pm b] to denote the interval [ab,a+b][a-b,a+b].

Denote by Φ(t)\Phi(t) the CDF of a standard Gaussian random variable and by Φ1()\Phi^{-1}(\cdot) its inverse function. Let

be the empirical estimate of the theoretical variance vijv_{ij}^{\star}. Then one has

In words, Corollary 1 tells us that for any fixed significance level 0<α<10<\alpha<1, the interval

is a nearly accurate two-sided (1α)(1-\alpha) confidence interval of MijM_{ij}^{\star}.

In addition, we remark that when Ui,2=Vj,2=0\|\bm{U}_{i,\cdot}^{\star}\|_{2}=\|\bm{V}_{j,\cdot}^{\star}\|_{2}=0 (and hence Vij=0V_{ij}^{\star}=0), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that

Last but not least, the careful readers might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with r,μ,κ=O(1)r,\mu,\kappa=O(1) for example: our conditions read

4 Lower bounds and optimality for inference

It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the de-biased estimator is optimal in some sense; for instance, it nearly attains the minimum covariance among all unbiased estimators. To formalize this claim, we shall

Quantify the performance of two ideal estimators with the assistance of an oracle;

Demonstrate that the performance of our de-biased estimators is arbitrarily close to that of the ideal estimators.

In what follows, we denote by Xi,\bm{X}_{i,\cdot}^{\star} (resp. Yi,\bm{Y}_{i,\cdot}^{\star}) the iith row of X\bm{X}^{\star} (resp. Y\bm{Y}^{\star}).

Suppose that there is an oracle informing us of Y\bm{Y}^{\star}, and that we observe the same set of data as in (2.3). Under such an idealistic setting and for any given 1in1\leq i\leq n, the following least-squares estimator achieves the minimum covariance among all unbiased estimators for the iith row Xi,\bm{X}_{i,\cdot}^{\star} of X\bm{X}^{\star} (see e.g. [Sha03, Theorem 3.7])

In other words, for any unbiased estimator u\bm{u} of Xi,\bm{X}_{i,\cdot}^{\star} (conditional on Ω\Omega), one has

where \mathsf{Cov}\big{(}\bm{X}_{i,\cdot}^{\mathsf{ideal}}\,\big{|}\,\Omega\big{)} is precisely the Cramér-Rao lower bound (conditional on Ω\Omega) under this ideal setting. As it turns out, with high probability, this lower bound concentrates around σ2(Σ)1/p\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p, as stated in the following lemma. The proof is postponed to Appendix H.1.

Fix an arbitrarily small constant ε>0\varepsilon>0. Suppose that n2pC0ε2κ4μrnn^{2}p\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu rn for some sufficiently large constant C0>0C_{0}>0 independent of nn. Then with probability at least 1O(n10)1-O(n^{-10}), one has

Given that ε\varepsilon can be an arbitrarily small constant, Lemma 1 uncovers that the covariance of the de-shrunken estimator Xi,d\bm{X}_{i,\cdot}^{\mathsf{d}} (cf. Theorem 1) matches that of the ideal estimator Xi,ideal\bm{X}_{i,\cdot}^{\mathsf{ideal}}, thus achieving the Cramér-Rao lower bound with high probability. The same conclusion applies to Yj,d\bm{Y}_{j,\cdot}^{\mathsf{d}} as well.

Suppose that there is another oracle informing us of {Xk,}k:ki\{\bm{X}_{k,\cdot}^{\star}\}_{k:k\neq i} and {Yk,}k:kj\{\bm{Y}_{k,\cdot}^{\star}\}_{k:k\neq j}; that is, everything about X\bm{X}^{\star} except Xi,\bm{X}_{i,\cdot}^{\star} and everything about Y\bm{Y}^{\star} except Yj,\bm{Y}_{j,\cdot}^{\star}. In addition, we observe the same set of data as in (2.3), except that we do not get to see MijM_{ij}.The exclusion of MijM_{ij} is merely for ease of presentation. One can consider the model where all MijM_{ij} with (i,j)Ω(i,j)\in\Omega are observed with a slightly more complicated argument. Under this idealistic model, the Cramér-Rao lower bound [Sha03, Theorem 3.3] for estimating Mij=Xi,(Yj,)M_{ij}^{\star}=\bm{X}_{i,\cdot}^{\star}(\bm{Y}_{j,\cdot}^{\star})^{\top} can be computed as

This means that any unbiased estimator of MijM_{ij}^{\star} must have variance no smaller than CRLB(MijΩ)\mathsf{CRLB}(M_{ij}^{\star}\mid\Omega). This quantity admits a much simpler lower bound as follows, whose proof can be found in Appendix H.2.

Fix an arbitrarily small constant ε>0\varepsilon>0. Suppose that n2pC0ε2κ4μrnlognn^{2}p\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu rn\log n for some sufficiently large constant C0>0C_{0}>0 independent of nn. Then with probability at least 1O(n10)1-O(n^{-10}),

where vijv_{ij}^{\star} is defined in Theorem 2.

Similar to Lemma 1, Lemma 2 reveals that the variance of our de-biased estimator MijdM_{ij}^{\mathsf{d}} (cf. Theorem 2) — which certainly does not have access to the side information provided by the oracle — is arbitrarily close to the Cramér-Rao lower bound aided by an oracle.

All in all, the above lower bounds demonstrate that the degrees of uncertainty underlying our de-shrunken low-rank factors and de-biased matrix are, in some sense, statistically minimal.

5 Back to estimation: the de-biased estimator is optimal

While the emphasis of the current paper is on inference, we would nevertheless like to single out an important consequence that informs the estimation step. To be specific, the decompositions and distributional guarantees derived in Theorem 1 and Theorem 2 allow us to track the estimation accuracy of Md\bm{M}^{\mathsf{d}}, as stated in the following theorem. The proof of this result is postponed to Appendix G.

In stark contrast to prior statistical estimation guarantees (e.g. [CP10, NW12, KLT11, CCF+19]), Theorem 3 pins down the estimation error of the proposed de-biased estimator in a sharp manner (namely, even the pre-constant is fully determined). Encouragingly, there is a sense in which the proposed de-biased estimator achieves the best possible statistical estimation accuracy, as revealed by the following result.

Fix an arbitrarily small constant ε>0\varepsilon>0. Suppose that n2pμrnlog2nn^{2}p\gtrsim\mu rn\log^{2}n, and that r=o(n)r=o(n). Then with probability exceeding 1O(n10)1-O(n^{-10}), any unbiased estimator M^\widehat{\bm{M}} of M\bm{M}^{\star} obeys

Intuitively, the term 2nr2nr reflects approximately the underlying degrees of freedom in the true subspace TT^{\star} of interest (i.e. the tangent space of the rank-rr matrices at the truth M\bm{M}^{\star}), whereas the factor 1/p1/p captures the effect due to sub-sampling. This result has already been established in [CP10, Section III.B] (together with [CR09, Theorem 4.1]). We thus omit the proof for conciseness. The key idea is to consider an oracle informing us of the true tangent space TT^{\star}. ∎

The implication of the above two theorems is remarkable: the de-biasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing the estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the pre-constant. As far as we know, this is the first theory about a polynomial time algorithm that matches the statistical limit in terms of the pre-constant. This intriguing finding is further corroborated by numerical experiments; see Section 3.6 for details (in particular, Figure 3).

6 Numerical experiments

We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 1 and Theorem 2. Note that our main results hold for the de-biased estimators built upon Zcvx\bm{Z}^{\mathsf{cvx}} and XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}. As we will formalize shortly in Section 5.1, these two de-biased estimators are extremely close to each other; see also Figure 4 for experimental evidence. Therefore, in order to save space, we use the de-biased estimator built upon the convex estimate Zcvx\bm{Z}^{\mathsf{cvx}} throughout the experiments.

We begin by checking the validity of Theorem 1. Suppose that one is interested in estimating the inner product eiXXej\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j} between Xei\bm{X}^{\star\top}\bm{e}_{i} and Xej\bm{X}^{\star\top}\bm{e}_{j} (iji\neq j). In the Netflix challenge, this might correspond to the similarity between the iith user and the jjth one. As a straightforward consequence of Theorem 1, the normalized estimation error

is extremely close to a standard Gaussian random variable. Here, similar to (3.22), we let

be the empirical estimate of the theoretically predicted variance σ2(Ui,22+Uj,22)/p\sigma^{2}(\|\bm{U}_{i,\cdot}^{\star}\|_{2}^{2}+\|\bm{U}_{j,\cdot}^{\star}\|_{2}^{2})/p. As a result, a 95%95\% confidence interval of eiXXej\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j} would be [eiXdXdej±1.96ρij][\bm{e}_{i}^{\top}\bm{X}^{\mathsf{d}}\bm{X}^{\mathsf{d}\top}\bm{e}_{j}\pm 1.96\sqrt{\rho_{ij}}]. For each (i,j)(i,j), we define Cov^L,(i,j)\widehat{\mathsf{Cov}}_{\mathsf{L},(i,j)} to be the empirical coverage rate of eiXXej\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j} over 200200 Monte Carlo simulations. Correspondingly, denote by Mean(Cov^L)\mathsf{Mean}(\widehat{\mathsf{Cov}}_{\mathsf{L}}) (resp. Std(Cov^L)\mathsf{Std}(\widehat{\mathsf{Cov}}_{\mathsf{L}})) the average (resp. the standard deviation) of Cov^L,(i,j)\widehat{\mathsf{Cov}}_{\mathsf{L},(i,j)} over indices 1i<jn1\leq i<j\leq n. Table 2 collects the simulation results for different values of (r,p,σ)(r,p,\sigma). As can be seen, the reported empirical coverage rates are reasonably close to the nominal level 95%95\%. In addition, Figure 1 depicts the Q-Q (quantile-quantile) plots of T12,T13T_{12},T_{13} and T14T_{14} vs. the standard Gaussian random variable over 200 Monte Carlo simulations for r=5r=5, p=0.4p=0.4 and σ=103\sigma=10^{-3}. It is clearly seen that all of these are well approximated by a standard Gaussian random variable.

Next, we turn to Theorem 2, namely the distributional guarantee for the entries of the matrix. Denote

where vijv_{ij} is the empirical variance defined in (3.22). In view of the 95%95\% confidence interval predicted by Corollary 1, and similar to what have done for the low-rank components, for each (i,j)(i,j), we define Cov^E,(i,j)\widehat{\mathsf{Cov}}_{\mathsf{E},(i,j)} to be the empirical coverage rate of MijM^{\star}_{ij} over 200200 Monte Carlo simulations. Correspondingly, denote by Mean(Cov^E)\mathsf{Mean}(\widehat{\mathsf{Cov}}_{\mathsf{E}}) (resp. Std(Cov^E)\mathsf{Std}(\widehat{\mathsf{Cov}}_{\mathsf{E}})) the average (resp. the standard deviation) of Cov^E,(i,j)\widehat{\mathsf{Cov}}_{\mathsf{E},(i,j)} over indices 1i,jn1\leq i,j\leq n. As before, Table 3 gathers the empirical coverage rates for MijM^{\star}_{ij} and Figure 2 displays the Q-Q (quantile-quantile) plots of S11,S12S_{11},S_{12} and S13S_{13} vs. the standard Gaussian random variable over 200 Monte Carlo trials for r=5r=5, p=0.4p=0.4 and σ=103\sigma=10^{-3}. It is evident that the distribution of SijS_{ij} matches that of N(0,1)\mathcal{N}(0,1) reasonably well.

We conclude this section with experiments on real data. Similar to [CP10], we use the daily temperature data [NCD19] for 1400 stations across the world in 2018, which results in a 1400×3651400\times 365 data matrix. Inspection on the singular values reveals that the data matrix is nearly low-rank. We vary the observation probability pp from 0.50.5 to 0.90.9 and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain 95%95\% confidence intervals for all the entries. Table 4 reports the empirical coverage probabilities, the average length of the confidence intervals as well as the estimation error of both Zcvx\bm{Z}^{\mathsf{cvx}} and Md\bm{M}^{\mathsf{d}} over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to 95%95\% and the confidence intervals are also quite short. In addition, the estimation error of Md\bm{M}^{\mathsf{d}} is smaller than that of Zcvx\bm{Z}^{\mathsf{cvx}}, which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that (1) the underlying true temperature matrix is only approximately low-rank, and (2) the noise in the temperature might not be independent.

7 A bit of intuition

We pause to develop some intuition behind the distributional guarantees for the proposed estimators. Bearing in mind the intimate link between convex and nonconvex optimization (cf. (3.5)), it suffices to concentrate on the nonconvex problem (3.4). For the sake of clarity, we further restrict attention to the rank-1 positive semidefinite case where M=xx\bm{M}^{\star}=\bm{x}^{\star}\bm{x}^{\star\top} and set λ=0\lambda=0, where one can focus on

Any optimizer x^\widehat{\bm{x}} of (3.34) would necessarily satisfy the first-order optimality condition

We shall also assume that x^\widehat{\bm{x}} is a reasonably reliable estimate obeying x^x\widehat{\bm{x}}\approx\bm{x}^{\star}.

We begin with the no-missing-data case (i.e. p=1p=1), which already conveys the key insight. The condition (3.35) simplifies to

which, through a little manipulation, leads to an equivalent decomposition:

Then: (1) the third term of (3.37), which can be viewed as a second-order term (in the sense that it is a quadratic term of x^x\widehat{\bm{x}}-\bm{x}^{\star}), becomes vanishingly small when x^x\widehat{\bm{x}}\approx\bm{x}^{\star}; (2) while the second term of (3.37) looks like a first-order term, it is natural to conjecture that x^x\widehat{\bm{x}}-\bm{x}^{\star} is sufficiently random and hence (x^x)xx^x2x2(\widehat{\bm{x}}-\bm{x}^{\star})^{\top}\bm{x}^{\star}\ll\|\widehat{\bm{x}}-\bm{x}^{\star}\|_{2}\|\bm{x}^{\star}\|_{2} (i.e. the estimation error is not aligned with x\bm{x}^{\star}), meaning that this term is also expected to be negligible compared to a typical first-order term (e.g. the term on the left-hand side of 3.37). In summary, these non-rigorous arguments suggest that

If one can be convinced that E\bm{E} and x^\widehat{\bm{x}} are only weakly dependent, then this means

Returning to the missing data scenario with p<1p<1, everything is based on the following approximation

this is certainly expected — using standard concentration arguments — if we “pretend” that PΩ\mathcal{P}_{\Omega} and x^\widehat{\bm{x}} are statistically independent. With this approximation in mind, one can translate (3.35) into

Repeating the above argument then immediately yields

The case with λ>0\lambda>0 can be intuitively understood in a very similar way by first de-shrinking the estimate; we omit it here for brevity. We note that these hand-waving arguments can all be made rigorous, which is the main content of the proof.

8 Inference based on spectral estimates?

One would naturally be curious about whether there are other estimation procedures that also enable reasonable statistical inference. While this is beyond the scope of the current paper, we take a moment to discuss one alternative: the spectral method, as pioneered by [KMO10a, KMO10b] in the matrix completion problem. In a nutshell, this approach consists in computing a rank-rr approximation to PΩ(M)/p\mathcal{P}_{\Omega}(\bm{M})/p, which is precisely the spectral initialization widely used in a two-stage nonconvex algorithm (cf. Algorithm 1) [KMO10b, SL16, CW15, CCF18, MWCC17]. While inference has not been, as far as we know, the focus of prior work on spectral methods,We note that inference from spectral estimates has been investigated in other context beyond matrix completion (e.g. the model without missing data [Xia19, FFHL19]). the recent papers [AFWZ17, MWCC17] hinted at the possibility of characterizing the distribution of the spectral estimate. Take a simple symmetric rank-1 case for example (i.e. M=xx\bm{M}^{\star}=\bm{x}^{\star}\bm{x}^{\star\top} with x2=1\|\bm{x}^{\star}\|_{2}=1): the leading eigenvector uspectral\bm{u}^{\mathsf{spectral}} of PΩ(M)/p\mathcal{P}_{\Omega}(\bm{M})/p often admits the following approximation (up to a global sign)

Expanding PΩ(M)=pxx+PΩ(xx)pxx+PΩ(E)\mathcal{P}_{\Omega}(\bm{M})=p\bm{x}^{\star}\bm{x}^{\star\top}+\mathcal{P}_{\Omega}(\bm{x}^{\star}\bm{x}^{\star\top})-p\bm{x}^{\star}\bm{x}^{\star\top}+\mathcal{P}_{\Omega}(\bm{E}), we arrive at

In words, two major factors dictate the uncertainty of the spectral estimate: (1) the additive Gaussian noise (cf. the 1st term on the right-hand side of (3.42)), and (2) random sub-sampling (in particular, the randomness incurred by employing the sub-sampled PΩ(xx)/p\mathcal{P}_{\Omega}(\bm{x}^{\star}\bm{x}^{\star\top})/p to approximate the truth xx\bm{x}^{\star}\bm{x}^{\star\top}). Given that the random sub-sampling effect cannot be ignored at all, the spectral estimates often suffer from a much larger estimation error (and hence a higher degree of uncertainty) compared to either the convex or the nonconvex estimates. In truth, this random sub-sampling effect does not go away even when the noise vanishes. Consequently, uncertainty quantification based on the spectral estimates may not be the most desirable option.

Prior art

Low-rank matrix completion, or more broadly, low-rank matrix recovery, is a fundamental task that permeates through a wide spectrum of applications in science, engineering, and finance (e.g. [RS05, SY07, CC14, FSZZ18, CCG15, ZPL15, BN06, CC18b, FWZ19, KS11, CZ16, KX15, FLM13, DR17, CDDD19, DPVW14, SZ12, FS11]). A paper of this length is unable to review all papers motivating and contributing to this enormous subject; interested readers are referred to [DR16, CC18a] for extensive discussions of motivating applications as well as the exciting recent development.

Numerous algorithms have been proposed to solve this problem efficiently, with two paradigms being arguably the most widely used: convex relaxation and nonconvex optimization. We briefly review the literature contributing to these two paradigms.

Nonconvex optimization algorithms, as pioneered by [KMO10a, Sre04], become increasingly more popular for solving various low-rank factorization problems, due to their appealing computational complexities [JNS13, CLS15, CC17, TBS+16, SL16, ZL16, CCFM19, WZG16, CLL19]. For instance, the gradient-based nonconvex methods have been analyzed for noisy matrix completion [KMO10b, CW15, MWCC17, CCF+19], which are shown to achieve near-optimal statistical accuracy and linear-time convergence guarantees all at once. Going beyond gradient methods, we note that other nonconvex methods (e.g. [RS05, JMD10, WYZ12, JNS13, FRW11, Van13, LXY13, Har14, JKN16, RT11, WCCL16, DC18, ZWL15, ZWYG18, MSL19, CCD+19]) and landscape properties [GLM16, CL17, GJZ17, ZSL19, ZJSL18, SXZ19] have been largely explored as well. The interested readers are referred to [CLC19] for an in-depth discussion. One limitation, however, is that the theoretical guarantees provided for nonconvex algorithms often exhibit sub-optimal dependency in the rank rr of the unknown matrix; for instance, most theory requires a sample complexity of at least nr2nr^{2} (in fact, often much larger than nr2nr^{2}). This is outperformed by the convex relaxation approach.

Despite these recent developments, very little work has investigated statistical inference for noisy matrix completion. While [CKLN18, CKL16, CN15, CEGN15] discussed the construction of “honest” confidence regions, the volume of these regions is dependent on some (possibly huge) hidden constants, thus resulting in over-coverage. Perhaps the closest to our paper is the recent work [Xia18], which investigated inference for low-rank trace regression. Employing a closely related de-biased estimator with sample splitting, the paper [Xia18] established asymptotic normality of a certain projected distance between the estimate and the truth. The result therein, however, requires a sampling mechanism obeying the restricted isometry property (e.g. i.i.d. Gaussian designs), which fails to hold for matrix completion. Also, our approach does not require sample splitting — a technique that is convenient for analysis but conservative in constructing confidence regions. Another work by Cai et al. [CLR16] developed a unified approach to provide inference guarantees for linear inverse problems including low-rank matrix estimation. Their results, however, require the sample size to exceed the total dimension n2n^{2} even under the Gaussian design. Finally, a recent line of work [MX17] explored uncertainty quantification under the Bayesian setting, hypothesizing on a special prior regarding the true matrix. This departs drastically from the scenario considered herein.

Inference in high-dimensional problems.

Inference in high-dimensional sparse regression has received much attention in the last few years [WR09, ZZ14, BCH11, vdGBRD14, JM14b, DBMM15, CG17, NL17, NNLL18, LSST16, LTTT14, MMB09, DBZ17, ZC17, BFL+18]. Our inferential approach is partly inspired by the recent developments on this topic, particularly with regard to the de-biased / de-sparsified estimators proposed for Lasso. More specifically, recognizing the non-negligible bias of the Lasso estimate

A line of work [ZZ14, vdGBRD14, JM14a] came up with a linear transformation of β^\widehat{\bm{\beta}} of the form

Interestingly, our de-biased estimator (3.7) for matrix completion admits a very similar form as (4.2). To see this, recall that our de-biased estimator is given by

where Z\bm{Z} can be either Zcvx\bm{Z}^{\mathsf{cvx}} or XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top} (see Table 1). Let TT be the tangent space of the set of rank-rr matrices at Zcvx,r\bm{Z}^{\mathsf{cvx},r} (resp. XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}) in the convex (resp. nonconvex) case, and PT\mathcal{P}_{T} be the projection operator onto TT. Somewhat surprisingly, replacing Prank-r\mathcal{P}_{\text{rank-}r} by PT\mathcal{P}_{T} does not affect the de-biased estimator by much, in the sense that

In addition, recognizing that Z\bm{Z} almost lies within the tangent space TT,More precisely, if Z=XncvxYncvx\bm{Z}=\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}, then ZT\bm{Z}\in T; if Z=Zcvx\bm{Z}=\bm{Z}^{\mathsf{cvx}}, one has PT(Z)Z\mathcal{P}_{T}(\bm{Z})\approx\bm{Z}.one can rewrite

Finally, de-biased estimators have been put forward to tackle other high-dimensional problems, including but not limited to generalized linear models [vdGBRD14, NL17], graphical models [JVDG15, RSZZ15, MLL17, JvdG17], sparse PCA [JvdG18], treatment effects estimation [CCD+18, AIW18]. These are beyond the scope of the current paper.

Architecture of the proof

This section outlines the main steps for establishing Theorem 1 and Theorem 2. Before starting, we introduce some useful notation. For convenience of presentation, we insert the factor 1/p1/p into (3.4) and redefine the nonconvex loss function as

In addition, for each 1j,kn1\leq j,k\leq n, we define the indicator δjk\mathds1{(j,k)Ω}\delta_{jk}\triangleq\operatorname{\mathds{1}}\{(j,k)\in\Omega\}, which is a Bernoulli random variable with mean pp.

We also note that Theorem 1 (resp. Theorem 2) is subsumed by Theorem 5 (resp. Theorem 6). As a result, we shall focus on establishing Theorem 5 (resp. Theorem 6) when it comes to estimating low-rank factors (resp. the entries of the matrix).

Instate the assumptions of Theorem 5. Recall the definition of vijv_{ij}^{\star} in (3.17). Then one has the following decomposition

where gijN(0,vij)g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star}) and the residual obeys — with probability exceeding 1O(n10)1-O(n^{-10}) — that

Note that Theorem 5 and Theorem 6 are concerned with the de-biased estimators built upon both convex and nonconvex estimates. At first glance, one needs to establish theoretical guarantees for each of them separately. Fortunately, as alluded to previously (cf. (3.5)), the convex and nonconvex estimates are extremely close — a fact that has been established in [CCF+19]. The proximity of these two estimates naturally extends to the de-biased estimators constructed based on them. As a result, it suffices to concentrate on proving the theorems for any of these estimators; the claims for the other one follow immediately.

The following key lemma formalizes this argument, which will be established in Appendix C (see also Figure 4 for numerical evidence). Before continuing, we remind the readers of the key notation (see Appendix B for precise definitions):

(Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}): an approximate solution to the nonconvex problem (3.4) (see Appendix A.1);

Mcvx,d,Xcvx,d,Ycvx,d\bm{M}^{\mathsf{cvx,d}},\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}}: the de-biased estimators built upon the convex optimizer Zcvx\bm{Z}^{\mathsf{cvx}};

Mncvx,d,Xncvx,d,Yncvx,d\bm{M}^{\mathsf{ncvx,d}},\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}}: the de-biased estimators built upon the nonconvex estimate (Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}).

Suppose that the sample size obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μnrlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu nr\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c>0c>0. Set λ=Cλσnp\lambda=C_{\lambda}\sigma\sqrt{np} with some large enough constant Cλ>0C_{\lambda}>0.

With probability at least 1O(n10)1-O(n^{-10}), one has

where Or×r\mathcal{O}^{r\times r} is the set of r×rr\times r rotation matrices.

With probability exceeding 1O(n10)1-O(n^{-10}), one has

In short, the first part of Lemma 3 tells us that

whereas the second part of Lemma 3 justifies that the proposed de-biased estimator is closely approximated by a linearized version (cf. (4.4)). Note that this linearized form bears a resemblance to the de-biased estimators developed for sparse linear regression [ZZ14, vdGBRD14, JM14a].

With Lemma 3 in place, we shall, from now on, focus on proving the main theorems for the nonconvex estimators, viz.

establishing Theorem 5 for the de-shrunken low-rank factors (Xncvx,d,Yncvx,d)(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}});

establishing Theorem 6 for the de-biased matrix estimator Xncvx,dYncvx,d\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top}.

To simplify the presentation hereafter, we shall use the following notation throughout the rest of this section:

(X,Y)(\bm{X},\bm{Y}): the nonconvex estimate (Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}});

(Xd,Yd)(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}}): the de-shrunken estimate defined in (3.8) based on (X,Y)=(Xncvx,Yncvx)(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}});

MdXdYd\bm{M}^{\mathsf{d}}\triangleq\bm{X}^{\mathsf{d}}\bm{Y}^{\mathsf{d}\top}.

2 A precise characterization of the de-shrunken low-rank factors

We start with a precise characterization of the de-shrunken low-rank factors Xd\bm{X}^{\mathsf{d}} and Yd\bm{Y}^{\mathsf{d}}, which paves the way for demonstrating both Theorem 5 and Theorem 6.

One has the following decompositions for Xd\bm{X}^{\mathsf{d}} and Yd\bm{Y}^{\mathsf{d}}

which measures the imbalance between the low-rank factors X\bm{X} and Y\bm{Y}.

The claims follow from straightforward algebraic manipulations; see Appendix D.1. ∎

We make a few observations regarding Lemma 4. Take the decomposition of Xd\bm{X}^{\mathsf{d}} (5.8a) as an example:

First, the term AYd(YdYd)1\bm{A}\bm{Y}^{\mathsf{d}}(\bm{Y}^{\mathsf{d}\top}\bm{Y}^{\mathsf{d}})^{-1} vanishes when we have full observations, i.e. p=1p=1. Second, the terms involving Xf(X,Y)\nabla_{\bm{X}}f(\bm{X},\bm{Y}) and Δbalancing\bm{\Delta}_{\mathsf{balancing}} are both zero if (X,Y)(\bm{X},\bm{Y}) is an exact stationary point of f(,)f(\cdot,\cdot); to see this, it is not hard to verify that any stationary point of f(,)f(\cdot,\cdot) necessarily satisfies XX=YY\bm{X}^{\top}\bm{X}=\bm{Y}^{\top}\bm{Y}, which in turn implies Δbalancing=0\bm{\Delta}_{\mathsf{balancing}}=\bm{0}. Consequently, the last three terms in (5.8a) are expected to be small when pp is sufficiently large and (X,Y)(\bm{X},\bm{Y}) is near a stationary point.

We shall make precise these arguments in subsequent subsections.

3 Taking global rotation into account

In order to invoke the decompositions of Xd\bm{X}^{\mathsf{d}} and Yd\bm{Y}^{\mathsf{d}} (cf. Lemma 4) to characterize the estimation errors, we still need to incorporate the (unrecoverable) rotation matrix. From now on, we shall focus primarily on the factor Xd\bm{X}^{\mathsf{d}}. The claims on the other factor Yd\bm{Y}^{\mathsf{d}} can be easily obtained via symmetry.

where we recall that Hd\bm{H}^{\mathsf{d}} is the rotation matrix that best aligns (Xd,Yd)(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}}) and (X,Y)(\bm{X}^{\star},\bm{Y}^{\star}) (see (3.12)). Substituting the identity

into the decomposition (5.8a), we arrive at

4 Key lemmas for establishing Theorem 5

We now state five key lemmas. Taking these collectively and substituting them into (5.11) immediately establish Theorem 5.

We shall start by controlling the term Φ1\bm{\Phi}_{1} as defined in (5.12).

Suppose that the sample complexity obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μrnlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c<0c<0. Then with probability at least 1O(n10)1-O(n^{-10}), we have

Fix any 1jn1\leq j\leq n. If the de-shrunken estimate Yd\overline{\bm{Y}}^{\mathsf{d}} were independent of the randomness in the jjth row of the matrix, i.e. ejPΩ(E)\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}(\bm{E}), then ejΦ12\|\bm{e}_{j}^{\top}\bm{\Phi}_{1}\|_{2} would be well controlled. This hypothesis is certainly false, as Yd\overline{\bm{Y}}^{\mathsf{d}} clearly depends on ejPΩ(E)\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}(\bm{E}). Nevertheless, by exploiting the leave-one-out technique recently used in [EKBB+13, EK15, AFWZ17, MWCC17, CFMW19, CCF+19, CLL19, DC18], one can properly decouple the dependency and establish the desired bound. See Appendix D.2.∎

The next lemma controls the size of Φ22,\|\bm{\Phi}_{2}\|_{2,\infty}. In essence, the term Φ2\bm{\Phi}_{2} measures the difference between the estimate Yd\overline{\bm{Y}}^{\mathsf{d}} and the true signal Y\bm{Y}^{\star}; the closer these two are, the smaller Φ22,\|\bm{\Phi}_{2}\|_{2,\infty} should be. See Appendix D.3 for the proof of the following result.

Suppose that the sample complexity obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μrnlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c<0c<0. Then with probability exceeding 1O(n10)1-O(n^{-10}), one has

Moving on to Φ3\bm{\Phi}_{3} and Φ4\bm{\Phi}_{4}, one has the following lemmas.

Suppose that the sample complexity obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μrnlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c<0c<0. Then with probability exceeding 1O(n10)1-O(n^{-10}), we have

It is straightforward to check that when p=1p=1, one has Φ32,=A=0\|\bm{\Phi}_{3}\|_{2,\infty}=\|\bm{A}\|=0, where we recall the definition of A\bm{A} in (5.7). Therefore, one expects Φ32,\|\bm{\Phi}_{3}\|_{2,\infty} to be small when pp is sufficiently large. See Appendix D.4.∎

Suppose that the sample complexity obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μrnlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c<0c<0. Then with probability at least 1O(n10)1-O(n^{-10}), one has

It is easily seen that the size of Φ4\bm{\Phi}_{4} depends on how close (X,Y)(\bm{X},\bm{Y}) is to a stationary point of f(,)f(\cdot,\cdot). For instance, in the extreme case where (X,Y)(\bm{X},\bm{Y}) is an exact stationary point, then one would have Φ4=0\bm{\Phi}_{4}=\bm{0}. See Appendix D.5.∎

The last lemma asserts that PΩ(E)Y(YY)1/p\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p is, in some sense, close to a zero-mean Gaussian random matrix with the desired covariance.

Suppose that the sample size obeys n2pCκ2μrnlog3nn^{2}p\geq C\kappa^{2}\mu rn\log^{3}n for some sufficiently large constant C>0C>0. Then one has the decomposition

Fix any 1jn1\leq j\leq n. The jjth row, namely ej[PΩ(E)Y(YY)1/p]\bm{e}_{j}^{\top}[\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p] is conditionally Gaussian in the sense that

where we recall that δjk=\mathds1{(j,k)Ω}\delta_{jk}=\operatorname{\mathds{1}}\{(j,k)\in\Omega\}. Recognize that the conditional covariance matrix concentrates sharply around its expectation, i.e. σ2(Σ)1/p\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p, which is the covariance matrix of ZXej\bm{Z}_{\bm{X}}^{\top}\bm{e}_{j} that we are after. Hence, one can expect that PΩ(E)Y(YY)1/p\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p is, marginally, not too far from a Gaussian random matrix. This argument can be carried out formally; see Appendix D.6. ∎

5 From low-rank factors to matrix entries (Proof of Theorem 6)

We now turn attention to inference on the matrix entries, by establishing Theorem 6. Towards this, we first make the following observation: for any 1i,jn1\leq i,j\leq n,

One can readily apply the decompositions in Theorem 5 to obtain

Take the preceding three identities collectively to reach

Following the same route as in Section 5.4, one can verify that Θij=eiZXYej+eiXZYej\Theta_{ij}=\bm{e}_{i}^{\top}\bm{Z}_{\bm{X}}\bm{Y}^{\star\top}\bm{e}_{j}+\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} is approximately Gaussian, whereas the residual term Λij\Lambda_{ij} is small in magnitude. These claims are formally stated in the next two lemmas, with the proofs deferred to Appendix E.

Suppose that the sample complexity obeys n2pCκ4μ2r2nlog3nn^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n for some sufficiently large constant C>0C>0 and the noise satisfies σ(κ4μrnlogn)/pcσmin\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min} for some sufficiently small constant c<0c<0. Then with probability exceeding 1O(n10)1-O(n^{-10}), one has

Suppose that npCκ2μr2log2nnp\geq C\kappa^{2}\mu r^{2}\log^{2}n for some sufficiently large constant C>0C>0. Then we have the decomposition

where gijN(0,vij)g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star}) and the remaining term θij\theta_{ij} satisfies — with probability exceeding 1O(n10)1-O(n^{-10}) — that

Putting the above two lemmas together immediately establishes Theorem 6 and hence Theorem 2.

Discussion

The present paper makes progress towards inference and uncertainty quantification for noisy matrix completion, by developing simple de-biased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely sub-optimal in the following aspects:

Dependency on the rank and the condition number. To enable valid inference, our sample complexity (cf. (3.13) and (3.18a)) scales sub-optimally with the rank rr and the condition number κ\kappa. The sub-optimality can be understood through comparisons with the sample size requirement O(nrlog2n)O(nr\log^{2}n) in the noise-free settings, which is independent of κ\kappa and matches the information limit (up to some log\log factor). Improving such dependency calls for more refined analysis techniques.

Detection of the size of the entries. On one hand, when the size of the entry MijM_{ij}^{\star} is moderately large (cf. (3.18b)), Corollary 1 allows us to construct a valid confidence interval for it. On the other hand, when Ui,2+Vj,2\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2} vanishes, Theorem 5 tells us that the estimation error MijdMijM_{ij}^{\mathsf{d}}-M_{ij}^{\star} is better approximated by the inner product of two independent Gaussian random vectors. It remains to be seen how to determine whether Ui,2+Vj,2\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2} is too small.

Low signal-to-noise (SNR) regime. Our theory operates under the moderate-to-high SNR regime, where σmin2/σ2\sigma^{2}_{\min}/\sigma^{2} (which is proportional to the SNR) is required to exceed the order of n/pn/p; see the conditions in Theorem 5. It is unclear whether the connection between the convex and the nonconvex estimators hold in the low SNR regime. How to conduct inference in such a scenario is an important future direction.

In addition, our investigation has been dedicated to a natural random model, which by no means covers the most general settings of practical interest. There are numerous possible extensions that merit future investigation:

Approximate low-rank structure. Our current theory is built upon the exact low-rank structure of M\bm{{M}}^{\star}. Realistically, the matrix of interest is often only approximately low-rank. It is of great interest to study how to carry out statistical inference under such imperfect structural assumptions.

More general sampling patterns. This paper operates under the uniform random sampling assumption, which might sometimes be off in practical situations. It would be interesting to investigate whether our results in this paper can extend to more general non-uniform sampling patterns (e.g. [NW12]).

Extensions to robust PCA, sparse PCA, and 1-bit matrix completion. A variety of important extensions of matrix completion have been explored in prior literature, including but not limited to the case with sparse outliers (i.e. robust PCA [CLMW11, CSPW11]), the case where the matrix of interest is simultaneously sparse and low-rank (i.e. sparse PCA [ZHT06, CMW13]), and the case where only finite-bit observations are available (i.e. 1-bit matrix completion [DPVW14, CZ13]). Performing valid uncertainty assessment for these scenarios requires non-trivial extensions of the link between convex and nonconvex optimization.

Other loss functions. In the estimation stage, one might sometimes prefer other loss functions beyond the penalized squared loss. This might arise from either statistical considerations (e.g. employing a penalized Poisson log-likelihood to accommodate Poisson noise [CX16]), or computational concerns (e.g. adopting a non-smooth loss to improve convergence [CCD+19]). It would be of fundamental importance to develop a unified inferential framework that covers a broader family of loss functions.

Acknowledgements

Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton SEAS innovation award. J. Fan is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C. Ma is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

Appendix A Preliminaries

In this section, we gather several notation and preliminary facts that are useful throughout the analysis. All the proofs, if needed, are deferred to Appendix I.

To begin with, we make precise the algorithm used to minimize the nonconvex loss function (5.1). Specifically, we describe the following details that are crucial for us to implement Algorithm 1:

Set the initial point to be (X0,Y0)=(X,Y)(\bm{X}^{0},\bm{Y}^{0})=(\bm{X}^{\star},\bm{Y}^{\star}) or the spectral initialization as in [MWCC17, CLL19];

Set the stepsize η1/(n6κ3σmax)\eta\asymp 1/(n^{6}\kappa^{3}\sigma_{\max});

Set the maximum number of iterations to be t0n23t_{0}\asymp n^{23};

The returned estimate is (Xncvx,Yncvx)(Xt,Yt)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})\triangleq(\bm{X}^{t_{\star}},\bm{Y}^{t_{\star}}), where

In words, we run gradient descent in Algorithm 1 until we reach a point whose gradient is exceedingly small.

Many of the preliminary facts below were established for the case (X0,Y0)=(X,Y)(\bm{X}^{0},\bm{Y}^{0})=(\bm{X}^{\star},\bm{Y}^{\star}) [CCF+19], which is certainly not implementable in practice, however, it serves as a good proxy for studying the convex estimator. Fortunately, the same theoretical guarantees stated in Appendix A.2 can be readily established for spectral initialization using almost the same arguments adopted in [MWCC17, CLL19, CCF+19]. We omit this part mainly for the sake of brevity.

To facilitate analysis, we introduce a set of auxiliary nonconvex loss functions. For any 1jn1\leq j\leq n, define

be the nonconvex estimate returned by this auxiliary algorithm (i.e. Algorithm 2), which serves as an approximate solution to (A.2).

A.2 Properties of approximate nonconvex solutions

This subsection gathers the properties of the (approximate) nonconvex solutions. Throughout this subsection, we use the shorthand

and recall the definition of (X(j),Y(j))(\bm{X}^{(j)},\bm{Y}^{(j)}) in (A.3). The regularization parameter is chosen to satisfy

The claims stated below hold under the sample complexity and the noise condition presumed in [CCF+19, Theorem 1] (see also Theorem 5 in the current manuscript)

The first set of facts is related to (X,Y)(\bm{X},\bm{Y}). In view of [CCF+19], F\bm{F} is a faithful estimateTechnically, the statements in [CCF+19, Lemma 5] are for η1/(nκ3σmax)\eta\asymp 1/(n\kappa^{3}\sigma_{\max}) and t0n18t_{0}\asymp n^{18}. Nevertheless, inspecting their proofs reveals that the claims continue to hold for our choices η1/(n6κ3σmax)\eta\asymp 1/(n^{6}\kappa^{3}\sigma_{\max}) and t0n23t_{0}\asymp n^{23}. of F\bm{F}^{\star}, in the sense that

hold with probability exceeding 1O(n10)1-O(n^{-10}). In addition, on the same high-probability event, one has

In words, the first claim ensures that (X,Y)(\bm{X},\bm{Y}) is an approximate stationary point of f(,)f(\cdot,\cdot); the second bound tells us that (X,Y)(\bm{X},\bm{Y}) is nearly balanced, in the sense that XXYY\bm{X}^{\top}\bm{X}\approx\bm{Y}^{\top}\bm{Y}; the last one formalizes the proximity between the convex solution and the nonconvex one; see also (3.5).

hold with probability at least 1O(n10)1-O(n^{-10}).

As has been shown in [CCF+19], the leave-one-out auxiliary point (X(j),Y(j))(\bm{X}^{(j)},\bm{Y}^{(j)}) satisfies

with probability exceeding 1O(n10)1-O(n^{-10}).

Parallel to the transition from (X,Y)(\bm{X},\bm{Y}) to (Xd,Yd)(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}}), we set

to be the de-shrunken estimators of X(j)\bm{X}^{(j)} and Y(j)\bm{Y}^{(j)}, respectively. We shall demonstrate in Appendix I that, with probability at least 1O(n10)1-O(n^{-10}),

In addition to these four sets of claims, we have the following immediate consequence of the incoherence condition (2.4)

Moreover, recall that A=(1/p)PΩ(XYXY)(XYXY)\bm{A}=(1/p)\cdot\mathcal{P}_{\Omega}\left(\bm{X}\bm{Y}^{\top}-\bm{X}^{\star}\bm{Y}^{\star\top}\right)-\left(\bm{X}\bm{Y}^{\top}-\bm{X}^{\star}\bm{Y}^{\star\top}\right) (cf. (5.7)). We obtain from the proof of [CCF+19, Lemma 8] that

Last but not least, we list a few simple but useful results: the nonconvex solution F\bm{F} satisfies

The same holds true if we replace F\bm{F} by either Fd\bm{F}^{\mathsf{d}}, F(j)\bm{F}^{(j)} Fd,(j)\bm{F}^{\mathsf{d},(j)} or their corresponding low-rank factors. Here jj can vary from 1 to nn.

Appendix B Summary of the proposed estimators

Let Zcvx\bm{Z}^{\mathsf{cvx}} be the minimizer of the convex program (3.1), and let (Xncvx,Yncvx)(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}) be the solution returned by the Algorithm 1 (with algorithmic details specified in Appendix A.1). Recall that Zcvx,r\bm{Z}^{\mathsf{cvx},r} is the best rank-rr approximation of Zcvx\bm{Z}^{\mathsf{cvx}}, viz.

In addition, we let the matrix estimate obtained by the nonconvex algorithm be ZncvxXncvxYncvx\bm{Z}^{\mathsf{ncvx}}\triangleq\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}. We further denote by (Xcvx,Ycvx)(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}}) the estimate of low-rank factors obtained by convex relaxation; more specifically, we set (Xcvx,Ycvx)(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}}) to be the balanced rank-rr factorization of Zcvx,r\bm{Z}^{\mathsf{cvx},r} obeying XcvxYcvx=Zcvx,r\bm{X}^{\mathsf{cvx}}\bm{Y}^{\mathsf{cvx}\top}=\bm{Z}^{\mathsf{cvx},r} and XcvxXcvx=YcvxYcvx\bm{X}^{\mathsf{cvx}\top}\bm{X}^{\mathsf{cvx}}=\bm{Y}^{\mathsf{cvx}\top}\bm{Y}^{\mathsf{cvx}}. With these notations in place, our de-biased and de-shrunken estimators can be summarized as follows.

De-shrunken estimators for low-rank factors:

Appendix C Proof of Lemma 3

Throughout this section, let UΣV\bm{U}\bm{\Sigma}\bm{V}^{\top} be the rank-rr SVD of the nonconvex estimate XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top} and TT the tangent space of the set of rank-rr matrices at XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}. Correspondingly, we denote by PT\mathcal{P}_{T} the projection operator onto the tangent space TT, and let PT=IPT\mathcal{P}_{T^{\perp}}=\mathcal{I}-\mathcal{P}_{T}, where I\mathcal{I} is the identity operator.

In essence, we intend to justify that Mcvx,d\bm{M}^{\mathsf{cvx,d}}, Mncvx,d\bm{M}^{\mathsf{ncvx,d}} and Xncvx,dYncvx,d\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top} are all very close to U(Σ+λpIr)V\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}.

Recall from the definition of the de-biased estimator Mcvx,d\bm{M}^{\mathsf{cvx,d}} (cf. (B.1a)) that

Replacing Zcvx\bm{Z}^{\mathsf{cvx}} by XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top} results in

Apply the proximity bound (A.12) to obtain (recall that in (A.12), one has (X,Y)=(Xncvx,Yncvx)(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}))

as long as n5pκ2n^{5}p\gg\kappa^{2}. In addition, in view of [CCF+19, Claim 2], one has the decomposition

where the middle line follows since UΣV\bm{U}\bm{\Sigma}\bm{V}^{\top} is defined to be the SVD of XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}.

where the last inequality results from (C.3) and (C.5). Combining the above two bounds with the fact that U(Σ+λpIr)V\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top} and PT(ΔZ1pR)\mathcal{P}_{T}(\bm{\Delta}_{\bm{Z}}-\frac{1}{p}\bm{R}) are orthogonal to each other, we arrive at the conclusion that U(Σ+λpIr)V\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top} is the top-rr SVD of C\bm{C} and

Second, let U^Σ^V^\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top} be the top-rr SVD of C+Δ\bm{C}+\bm{\Delta}. By definition, one has U^Σ^V^=Mcvx,d\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top}=\bm{M}^{\mathsf{cvx,d}}. We are left with checking the two conditions in Lemma 14. To begin with, the perturbation term Δ\bm{\Delta} obeys

where (i) comes from (C.3) and (C.5) and the last inequality (ii) arises since npκ2.np\gg\kappa^{2}. Clearly, the size of the perturbation is much smaller than λ/p\lambda/p and hence C\|\bm{C}\| (cf. (C.8a)). In addition,

where the equality depends on (C.8b) and the last line results from (C.7) and (C.9). Consequently,

Here the first relation arises from (C.8a) and the second holds since σr(Σ)σmin/2\sigma_{r}(\bm{\Sigma})\geq\sigma_{\min}/2, a simple consequence of (A.19). We are now ready to apply Lemma 14 to obtain

where we have used the fact that Σ+(λ/p)Irσmax\|\bm{\Sigma}+(\lambda/p)\bm{I}_{r}\|\lesssim\sigma_{\max}, which also can be derived from (A.19). The above bound combined with (C.9) yields

as long as npκ3np\gg\kappa^{3}. We remark that by setting ΔZ=0\bm{\Delta}_{\bm{Z}}=\bm{0}, one also obtains the bound on Mncvx,d\bm{M}^{\mathsf{ncvx,d}}, i.e.

We move on to investigating Xncvx,dYncvx,dU(Σ+λpIr)V\|\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top}-\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}\|, for which we have the following claim.

Taking the above three bounds collectively and recognizing that λσnp\lambda\lesssim\sigma\sqrt{np} yield the advertised bound (5.4a).

The last inequality is the small-gradient condition (see (A.10), in which (X,Y)=(Xncvx,Yncvx))(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})). Employ the definitions for Xncvx,d\bm{X}^{\mathsf{ncvx,d}} and Yncvx,d\bm{Y}^{\mathsf{ncvx,d}} (cf. (B.2a) and (B.2b)) to see that

It then boils down to showing that (i) A1\bm{{A}}_{1} is very close to U(Σ+λpIr)V\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}, and (ii) A2\bm{{A}}_{2} is small in size.

First, recall that XncvxYncvx=UΣV\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{{U}}\bm{{\Sigma}}\bm{{V}}^{\top}, which combined with (C.12) gives

Here, the last inequality comes from (C.13) and its immediate consequence that ΣQΣQ11\|\bm{{\Sigma}}_{\bm{{Q}}}\|\asymp\|\bm{{\Sigma}}_{\bm{{Q}}}^{-1}\|\asymp 1.

Second, apply the perturbation bound for matrix square roots (cf. Lemma 13) to obtain

Here, the inequality (i) depends on the facts that

whereas the inequality (ii) holds because of the facts that (XncvxXncvx)11/σmin\|(\bm{X}^{\mathsf{ncvx}\top}\bm{X}^{\mathsf{ncvx}})^{-1}\|\lesssim 1/\sigma_{\min}, (YncvxYncvx)11/σmin\|(\bm{Y}^{\mathsf{ncvx}\top}\bm{Y}^{\mathsf{ncvx}})^{-1}\|\lesssim 1/\sigma_{\min} and the balancedness condition (A.11)

As a result, the operator norm of A2\bm{{A}}_{2} is bounded by

Take (C.14), (C.15) and (C.17) collectively to arrive at

C.2 Proof of the inequality (5.4b)

Next, we switch attention to the low-rank factors. Our goal is to demonstrate that (Xcvx,d,Ycvx,d)(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}}) and (Xncvx,d,Yncvx,d)(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}}) are both extremely close to (U(Σ+λpIr)1/2,V(Σ+λpIr)1/2)(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2}) modulo some global rotation, which will be established in (C.19) and (C.20) shortly.

We start by justifying the proximity between (Xncvx,d,Yncvx,d)(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}}) and (U(Σ+λpIr)1/2,V(Σ+λpIr)1/2)(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2}). In view of (C.12), we know that

The latter bound follows from similar derivations as in (C.16). A similar bound holds for Yncvx,d\bm{Y}^{\mathsf{ncvx,d}}. Recognizing that

Next, we establish the connection between (Xcvx,d,Ycvx,d)(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}}) and (U(Σ+λpIr)1/2,V(Σ+λpIr)1/2)(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2}). To accomplish this, we first study the relationship between (Xcvx,Ycvx)(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}}) and (UΣ1/2,VΣ1/2)(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2}). Recall that Xcvx\bm{X}^{\mathsf{cvx}} and Ycvx\bm{Y}^{\mathsf{cvx}} constitute a balanced factorization of Zcvx,r\bm{Z}^{\mathsf{cvx},r}, while (UΣ1/2,VΣ1/2)(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2}) is a balanced one of XncvxYncvx=UΣV\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{U}\bm{\Sigma}\bm{V}^{\top}. Hence one can view Zcvx,r\bm{Z}^{\mathsf{cvx},r} as a perturbation of XncvxYncvx=UΣV\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{U}\bm{\Sigma}\bm{V}^{\top} and investigate the perturbation bounds on the balanced factorizations. Going through the same derivations as in [MWCC17, Appendix B.7] and [CLL19, Appendix B.2.1], one reaches

Here the last relation follows from the proximity of the convex estimator and the nonconvex estimator; see (A.12). Repeating the same argument as above to translate the bound between (Xcvx,Ycvx)(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}}) and (UΣ1/2,VΣ1/2)(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2}) to that of (Xcvx,d,Ycvx,d)(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}}) and (U(Σ+λpIr)1/2,V(Σ+λpIr)1/2)(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2}), we conclude that

This together with (C.19) and the assumption nκ4n\gg\kappa^{4} concludes the proof.

C.3 Proof of the inequality (5.5)

We shall focus on proving the claim for the nonconvex estimator Mncvx,d\bm{M}^{\mathsf{ncvx,d}} and XncvxYncvx\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}; the claim for the convex estimator Zcvx\bm{Z}^{\mathsf{cvx}} can be treated similarly.

To see this, it has been established in Appendix C.1 that

Appendix D Analysis of the low-rank factors

We concentrate on the factor Xd\bm{X}^{\mathsf{d}}; the other factor Yd\bm{Y}^{\mathsf{d}} can be treated similarly. By definition of the gradient, one has

with A\bm{A} defined in (5.7), we can rearrange (D.1) as follows

By construction, the de-shrunken estimator Yd\bm{Y}^{\mathsf{d}} satisfies the following property

where the last identity follows since (YY)1/2(\bm{Y}^{\top}\bm{Y})^{1/2} and (Ir+λp(YY)1)1/2(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2} commute. Combining (D.3) with the identity (D.4) gives

Multiplying both sides of (D.5) by (Ir+λp(YY)1)1/2(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2} and recalling the definition of Yd\bm{Y}^{\mathsf{d}} in (3.8), we have

Since YdYd\bm{Y}^{\mathsf{d}\top}\bm{Y}^{\mathsf{d}} and (Ir+λp(YY)1)1/2(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2} also commute, we have

where the last relation uses the definition of Xd\bm{X}^{\mathsf{d}} (see (3.8)).

Substituting the identity (D.7) back into (D.6) and making a few elementary algebraic manipulations yield the desired decomposition (5.8a).

D.2 Proof of Lemma 5

Recall that Yd=YdHd\overline{\bm{Y}}^{\mathsf{d}}=\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}} and similarly define

The triangle inequality tells us that for any fixed 1jn1\leq j\leq n,

In what follows, we shall control α1\alpha_{1} and α2\alpha_{2} separately.

To begin with, denoting Δ(j)Yd,(j)(Yd,(j)Yd,(j))1Y(YY)1\bm{\Delta}^{(j)}\triangleq\overline{\bm{Y}}^{\mathsf{d},(j)}(\overline{\bm{Y}}^{\mathsf{d},(j)\top}\overline{\bm{Y}}^{\mathsf{d},(j)})^{-1}-\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1} results in

Before proceeding, we gather a few useful facts regarding Δ(j)\bm{\Delta}^{(j)}, as summarized in the following claim.

With probability at least 1O(n11)1-O(n^{-11}), we have

With the bounds on Δ(j)\|\bm{\Delta}^{(j)}\| and Δ(j)2,\|\bm{\Delta}^{(j)}\|_{2,\infty} in place, we are ready to control α1\alpha_{1}. By construction, Δ(j)\bm{\Delta}^{(j)} is independent of ejPΩ(E)\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}\left(\bm{E}\right). Therefore, the vector on the right-hand side of (D.8), 1pk=1nEjkδjkΔk,(j)\frac{1}{p}\sum_{k=1}^{n}E_{jk}\delta_{jk}\bm{\Delta}_{k,\cdot}^{(j)}, is a sum of conditionally independent random vectors. In particular, conditional on Δ(j)\bm{\Delta}^{(j)} and {δjk}k:1kn\{\delta_{jk}\}_{k:1\leq k\leq n}, one has

Invoke the concentration inequality for Gaussian random vectors [HKZ12, Proposition 1.1] to see that

with probability at least 1et1-e^{-t}. It remains to control Σ^\|\hat{\bm{\Sigma}}\|, which we state in the following claim.

Suppose that n2pκ2μrnlog2nn^{2}p\gg\kappa^{2}\mu rn\log^{2}n. Then with probability exceeding 1O(n11)1-O(n^{-11}),

Combine the upper bound on Σ^\|\hat{\bm{\Sigma}}\| with (D.10) and choose tlognt\asymp\log n to arrive at

with probability exceeding 1O(n11)1-O(n^{-11}).

We move on to bounding α2\alpha_{2}, for which we have

Here (i) uses the fact that PΩ(E)σnp\|\mathcal{P}_{\Omega}(\bm{E})\|\lesssim\sigma\sqrt{np} (see [CCF+19, Lemma 3]), the perturbation bounds for pseudo-inverses (see Lemma 12) and (A.19); the penultimate inequality (ii) comes from the fact that YdYd,(j)κσσminnlognpY2,\|\overline{\bm{Y}}^{\mathsf{d}}-\overline{\bm{Y}}^{\mathsf{d},(j)}\|\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n\log n}{p}}\|\bm{Y}^{\star}\|_{2,\infty} (see (A.16c)) and last one uses the incoherence condition Y2,μrσmax/n\|\bm{Y}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n} (cf. (A.17)).

Combine the bounds on α1\alpha_{1} and α2\alpha_{2} to reach

Taking the maximum over 1jn1\leq j\leq n establishes our bound on Φ12,\|\bm{\Phi}_{1}\|_{2,\infty}.

Apply the perturbation bound for pseudo-inverses (see Lemma 12) to obtain

Here we have utilized the facts that Yd,(j)YκσσminnpX\|\overline{\bm{Y}}^{\mathsf{d},(j)}-\bm{Y}^{\star}\|\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n}{p}}\|\bm{X}^{\star}\| (see (A.16a)) and a simple consequence of (A.19), viz.

Moreover, the triangle inequality tells us that

where the penultimate inequality follows from the facts that Yd,(j)2,2F2,\|\overline{\bm{Y}}^{\mathsf{d},(j)}\|_{2,\infty}\leq 2\|\bm{F}^{\star}\|_{2,\infty}, Yd,(j)Y2,κσσminnlognpF2,\|\overline{\bm{Y}}^{\mathsf{d},(j)}-\bm{Y}^{\star}\|_{2,\infty}\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n\log n}{p}}\|\bm{F}^{\star}\|_{2,\infty} (see (A.16b)) and that

Here the penultimate inequality follows from (A.19). The proof of the claim is then complete. ∎

Conditional on Δ(j)\bm{{\Delta}}^{(j)}, using Bernstein’s inequality and the fact that Δ(j)\bm{\Delta}^{(j)} and {δjk}k:1kn\{\delta_{jk}\}_{k:1\leq k\leq n} are independent, we arrive at that with probability exceeding 1O(n11)1-O(n^{-11}),

As a result, with probability at least 1O(n11)1-O(n^{-11}), we have

as long as npκ2μrlog2nnp\gg\kappa^{2}\mu r\log^{2}n. Here the middle inequality uses Claim 2. In view of the triangle inequality,

with the proviso that n2pκ2μrnlog2nn^{2}p\gg\kappa^{2}\mu rn\log^{2}n. Again, the last line makes use of Claim 2. This concludes the proof of the claim. ∎

D.3 Proof of Lemma 6

where the second inequality follows from the incoherence assumption that ejX2X2,μrσmax/n\|\bm{e}_{j}^{\top}\bm{X}^{\star}\|_{2}\leq\|\bm{X}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n} (cf. (A.17)) and the fact that (YdYd)11/σmin\|(\overline{\bm{Y}}^{\mathsf{d}\top}\overline{\bm{Y}}^{\mathsf{d}})^{-1}\|\lesssim 1/\sigma_{\min}, a simple consequence of (A.19).

It remains to control (YdY)Yd\|(\overline{\bm{Y}}^{\mathsf{d}}-\bm{Y}^{\star})^{\top}\overline{\bm{Y}}^{\mathsf{d}}\|. To simplify notation hereafter, define ΔXXdX\bm{\Delta}_{\bm{X}}\triangleq\overline{\bm{X}}^{\mathsf{d}}-\bm{X}^{\star} and ΔYYdY\bm{\Delta}_{\bm{Y}}\triangleq\overline{\bm{Y}}^{\mathsf{d}}-\bm{Y}^{\star}. First, observe that

Second, in view of the decomposition of Yd\bm{Y}^{\mathsf{d}} given in (5.8b), we have

The following claim connects ΔYY\bm{\Delta}_{\bm{Y}}^{\top}\bm{Y}^{\star} with XΔX\bm{X}^{\star\top}\bm{\Delta}_{\bm{X}}.

This relation together with (D.18) yields

A little algebraic manipulation then gives

It is easy to check from (A.19) that 0.25σminIrXdXd,Σ4σmaxIr0.25\sigma_{\min}\bm{I}_{r}\preceq\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}},\bm{\Sigma}^{\star}\preceq 4\sigma_{\max}\bm{I}_{r}. Hence one can invoke Lemma 15 with X=ΔYY\bm{X}=\bm{\Delta}_{\bm{Y}}^{\top}\bm{Y}^{\star}, A=Σ\bm{A}=\bm{\Sigma}^{\star}, B=XdXd\bm{B}=\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}} and C=XdXdS+0.5(ΔXΔXΔYΔY)Σ+ΔXYdΣ\bm{C}=\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\bm{S}+0.5(\bm{\Delta}_{\bm{X}}^{\top}\bm{\Delta}_{\bm{X}}-\bm{\Delta}_{\bm{Y}}^{\top}\bm{\Delta}_{\bm{Y}})\bm{\Sigma}^{\star}+\bm{{\Delta}}_{\bm{{XY}}}^{\mathsf{{d}}}\bm{\Sigma}^{\star} to obtain

where we have plugged in the definition of S\bm{S} (see (D.19)) and used the identity XdXdHd(XdXd)1=Hd\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\bm{H}^{\mathsf{d}\top}(\bm{X}^{\mathsf{d}\top}\bm{X}^{\mathsf{d}})^{-1}=\bm{H}^{\mathsf{d}\top}. Combine the above inequality with (D.16) to obtain

It then boils down to controlling the above terms α1,α2,α3\alpha_{1},\alpha_{2},\alpha_{3} and α4\alpha_{4}.

First, the term α4\alpha_{4} can be upper bounded by

Here, the second line utilizes the facts that (Ir+λ(XX)1/p)1/21\|(\bm{I}_{r}+\lambda(\bm{X}^{\top}\bm{X})^{-1}/p)^{1/2}\|\lesssim 1, XdXdYYσmax\|\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\|\asymp\|\bm{Y}^{\top}\bm{Y}^{\star}\|\lesssim\sigma_{\max} and the results in (A.10), (A.13e) and (C.16).

Moving on to α3\alpha_{3}, we recall from (A.13b) that

Finally, for the term α1\alpha_{1}, by the triangle inequality one has

Observe that conditional on {δjk}1j,kn\{\delta_{jk}\}_{1\leq j,k\leq n} one has

As a result, we obtain that with probability at least 1O(n10)1-O(n^{-10}),

Here, the second relation uses the fact that

with probability at least 1O(n10)1-O(n^{-10}) as long as n2pμrnlognn^{2}p\gg\mu rn\log n, which follows from [MWCC17, Lemma 38] or [CR09, Section 4.2] by observing that X,i(Y,j)\bm{{X}}_{\cdot,i}^{\star}(\bm{{Y}}_{\cdot,j}^{\star})^{\top} lies in the tangent space of M\bm{{M}}^{\star}. Take (D.21) and (D.22) collectively to reach

Substituting the bounds on α1\alpha_{1}, α2\alpha_{2}, α3\alpha_{3} and α4\alpha_{4} back to (D.20) results in

Taking the maximum over 1jn1\leq j\leq n leads to the desired result.

Finally, we are left with proving Claim 4.

First, by XX=YY\bm{X}^{\star\top}\bm{X}^{\star}=\bm{Y}^{\star\top}\bm{Y}^{\star}, one can obtain

Second, since Hd\bm{{H}}^{\mathsf{{d}}} is the best rotation matrix to align (Xd,Yd)(\bm{X}^{\mathsf{{d}}},\bm{Y}^{\mathsf{{d}}}) and (X,Y)(\bm{X}^{\star},\bm{Y}^{\star}), we know from [MWCC17, Lemma 35] that

D.4 Proof of Lemma 7

Here (i) and (iv) rely on the unitary invariance of the operator norm, (ii) uses the definition of Yd\bm{Y}^{\mathsf{d}} (see (3.8)) and (iii) follows from the choice λσnp\lambda\lesssim\sigma\sqrt{np} and immediate consequences of (A.19)

Therefore, it suffices to control ejAYH2\|\bm{e}_{j}^{\top}\bm{A}\bm{Y}\bm{H}\|_{2}. To this end, we have the following decomposition

we can rewrite the first term of (D.25) as

Since (X(j),Y(j))(\bm{X}^{(j)},\bm{Y}^{(j)}) is independent of {δjk}1kn,\{\delta_{jk}\}_{1\leq k\leq n}, the right hand side of the above equation can be viewed as a sum of independent random vectors, conditional on (X(j),Y(j))(\bm{X}^{(j)},\bm{Y}^{(j)}). Invoke Bernstein’s inequality to see that

holds with probability at least 1O(n10)1-O(n^{-10}). Here, we denote

Here the penultimate inequality uses (A.14d) and the bound Y(j)2,μrσmax/n\|\bm{Y}^{(j)}\|_{2,\infty}\lesssim\sqrt{\mu r\sigma_{\max}/n}. We arrive at the conclusion that: with probability exceeding 1O(n10)1-O(n^{-10}),

Next, we move on to the second term Δ2\bm{\Delta}_{2} of (D.25), which can be further decomposed as follows

In what follows, we bound θ1\bm{\theta}_{1} and θ2\bm{\theta}_{2} sequentially.

Regarding θ1\bm{\theta}_{1}, using the definition of A\bm{A} we obtain

where the second relation holds due to (A.14b) and the fact that Aσnpκ4μ2r2lognnp\|\bm{A}\|\lesssim\sigma\sqrt{\frac{n}{p}}\sqrt{\frac{\kappa^{4}\mu^{2}r^{2}\log n}{np}} (cf. (A.18)).

Moving on to θ2\bm{\theta}_{2}, we can utilize the identity

With regards to α1\alpha_{1}, we have by Bernstein’s inequality and (A.14b) that

holds with probability exceeding 1O(n10)1-O(n^{-10}). Here, we define

provided that npμrlognnp\gg\mu r\log n. Here we apply the bounds Y(j)σmax\|\bm{Y}^{(j)}\|\lesssim\sqrt{\sigma_{\max}} and Y(j)2,μrσmax/n\|\bm{Y}^{(j)}\|_{2,\infty}\lesssim\sqrt{\mu r\sigma_{\max}/n} (see (A.19) and the following remarks). In the end, we turn to the term α2\alpha_{2}, which obeys

where the second line arises from the Cauchy-Schwarz inequality.

Take the previous bounds collectively to arrive at

as long as σσminκ2nlognp1\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{\kappa^{2}n\log n}{p}}\ll 1. Finally, we conclude that

D.5 Proof of Lemma 8

First, it is straightforward to verify that

where the last line arises from (A.10), the choice λσnp\lambda\lesssim\sigma\sqrt{np} (cf. (A.6)), and the bounds

Here the latter two are immediate consequences of (A.19). Second, with regards to the term involving Δbalancing\bm{\Delta}_{\mathsf{balancing}}, we have

where the middle line uses (A.19) and the last one follows from (C.16).

Combine (D.27), (D.28) and the triangle inequality to establish the advertised result, with the proviso that n2κ3μrn^{2}\gg\kappa^{3}\mu r.

D.6 Proof of Lemma 9

We invoke the identity Y(YY)1=V(Σ)1/2\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}=\bm{V}^{\star}(\bm{\Sigma}^{\star})^{-1/2} (since Y=V(Σ)1/2\bm{Y}^{\star}=\bm{V}^{\star}(\bm{\Sigma}^{\star})^{1/2}) to see that for any 1in1\leq i\leq n,

consists of a sum of independent random vectors, where we recall that δik=\mathds1{(i,k)Ω}\delta_{ik}=\operatorname{\mathds{1}}\{(i,k)\in\Omega\}. In addition, the right-hand side of the above formula is conditionally Gaussian, namely,

Note that S\bm{S} depends on the index ii through {δik}k:1kn\{\delta_{ik}\}_{k:1\leq k\leq n}. Denote by S\bm{S}^{\star} the expectation of S\bm{S}, that is,

Clearly, when npκ2μrlognnp\gg\kappa^{2}\mu r\log n, one has S0\bm{S}\succ\bm{0} on the event E\mathcal{E} and hence S1/2\bm{S}^{-1/2} is well-defined. As a result, on the event E\mathcal{E}, we have

As can be easily seen from (D.29) and (D.30), each row of ZX\bm{Z}_{\bm{X}} follows the Gaussian distribution

It remains to show that, with high probability,

In what follows, we shall bound the two terms on the right-hand side of the above display sequentially.

First, observe that k=1nEikδikVk,\sum_{k=1}^{n}E_{ik}\delta_{ik}\bm{V}_{k,\cdot}^{\star} involves a sum of independent random vectors with

where ψ1\|\cdot\|_{\psi_{1}} denotes the sub-exponential norm [Ver17]. One can then apply the matrix Bernstein inequality [Kol11, Theorem 2.7] to conclude that with probability at least 1O(n20)1-O(n^{-20}),

Next, we move on to IrS1/2(S)1/2\|\bm{I}_{r}-\bm{S}^{-1/2}(\bm{S}^{\star})^{1/2}\|. Recall that on the event E\mathcal{E}, one has

This together with the fact that σ2/(pσmax)λmin(S)λmax(S)σ2/(pσmin)\sigma^{2}/(p\sigma_{\max})\leq\lambda_{\min}(\bm{S}^{\star})\leq\lambda_{\max}(\bm{S}^{\star})\leq\sigma^{2}/(p\sigma_{\min}) gives

with the proviso that npκ2μrlognnp\gg\kappa^{2}\mu r\log n. Therefore, straightforward calculations yield

Here the second relation is the perturbation bound for the matrix square roots (see Lemma 13).

Combine the above two bounds to conclude that

with probability at least 1O(n10)1-O(n^{-10}). Here the last line utilizes the matrix Bernstein inequality, where

Consequently with probability exceeding 1O(n10)1-O(n^{-10}) one has

Appendix E Analysis of the entries of the matrix

The term Λij\Lambda_{ij} can be naturally split into two terms, namely

In what follows, we shall bound each term individually.

Regarding the first term, one sees from Theorem 5 that with probability exceeding 1O(n10)1-O(n^{-10})

where the last line follows since Xi,2σmaxUi,2\|\bm{X}_{i,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{U}_{i,\cdot}^{\star}\|_{2} and Yj,2σmaxVj,2\|\bm{Y}_{j,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{V}_{j,\cdot}^{\star}\|_{2}.

Turning to the second term, we have by the Cauchy-Schwarz inequality that

where the penultimate inequality uses (A.13d) and the last one depends on the incoherence assumption that F2,μrσmax/n\|\bm{F}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n} (see (A.17)).

Take collectively the above two bounds to complete the proof.

E.2 Proof of Lemma 11

If ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} were independent, then clearly one would have

As such, the main ingredient of the proof boils down to demonstrating that ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i} and ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} are nearly independent.

Fortunately, this weak dependency can be easily decoupled, for which we have the following claim.

Suppose that npκ2μr2log2nnp\gg\kappa^{2}\mu r^{2}\log^{2}n. One has the decomposition

where Z~XeiN(0,σ2(Σ)1/p)\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i}\sim\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p) and is independent of {δkj,Ekj}k:1kn\{\delta_{kj},E_{k}j\}_{k:1\leq k\leq n} and hence of ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}. In addition, with probability at least 1O(n10)1-O(n^{-10}) one has

The desired result follows immediately from Claim 5, since

Similarly, repeating the same argument above, we can also show that eiZXYej+eiXZYej\bm{e}_{i}^{\top}\bm{Z}_{\bm{X}}\bm{Y}^{\star\top}\bm{e}_{j}+\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j} can be decomposed as a Gaussian random variable N(0,vij)\mathcal{N}\left(0,v_{ij}^{\star}\right) as well as a residual term bounded above by (σ/p)(κ2μrlogn)/(np)Ui,2,({\sigma}/{\sqrt{p}})\sqrt{({\kappa^{2}\mu r\log n})/({np})}\|\bm{U}_{i,\cdot}^{\star}\|_{2,\infty} with high probability. These together finish the proof.

Instate the notation used in Appendix D.6. Recall that

To remove the effect of δij,Eij{\delta_{ij},E_{ij}} on ZXei\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}, we construct an auxiliary random matrix Z~X\widetilde{\bm{Z}}_{\bm{X}} as follows

where S=p1σ2(Σ)1\bm{S}^{\star}=p^{-1}\sigma^{2}\left(\bm{\Sigma}^{\star}\right)^{-1},

It is easily seen that Z~XeiN(0,σ2(Σ)1/p)\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i}\sim\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p); more importantly Z~Xei\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i} is independent of {δkj,Ekj}1kn\{\delta_{kj},E_{kj}\}_{1\leq k\leq n} and hence of ZYej\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}.

which together with the triangle inequality and the fact S=σ2/(pσmin)\|\bm{S}^{\star}\|=\sigma^{2}/(p\sigma_{\min}) yields

Here we have used the results in (D.32) and (D.33). We are left with bounding S1/2Sj1/2\|\bm{S}^{-1/2}-\bm{S}_{-j}^{-1/2}\|, for which we have

Take the above bound collectively with (D.33) to yield

as long as npκμrnp\gg\kappa\mu r. As a result, we have

where the middle line relies on the perturbation of matrix square roots; see Lemma 13. Combining all, we arrive at

with the proviso that npκ2μr2log2nnp\gg\kappa^{2}\mu r^{2}\log^{2}n. This finishes the proof. ∎

Appendix F Proof of Corollary 1

This section is dedicated to establishing the following result, which subsumes Corollary 1 as a special case.

Suppose that the conditions (3.18) hold, and recall the notation in Corollary 1. Then one has

Before entering the main proof of Corollary 2, we make a simple observation that

where we recall that Xd=XdHd\overline{\bm{X}}^{\mathsf{d}}=\bm{X}^{\mathsf{d}}\bm{H}^{\mathsf{d}} and Yd=YdHd\overline{\bm{Y}}^{\mathsf{d}}=\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}}. Here, the first inequality arises from (A.13d) and the second one uses the assumption on Ui,2+Vj,2\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2} (i.e. (3.18b)). A simple consequence of (F.1) is that

which in conjunction with Theorem 6 yields the following decomposition

With this decomposition at hand, we have that for any ε>0\varepsilon>0,

With probability exceeding 1O(n10)1-O(n^{-10}), the term ΔV\Delta_{V} obeys

With Claim 6 at hand, we are ready to take

for any tt. This immediately establishes Corollary 2.

Suppose for the moment that vijvijcvij|v_{ij}-v_{ij}^{\star}|\leq cv_{ij}^{\star} for some c1/2c\leq 1/2. Then it follows immediately that

Therefore if suffices to control MijdMij|M_{ij}^{\mathsf{d}}-M_{ij}^{\star}| and vijvij|v_{ij}^{\star}-v_{ij}| (i.e. obtaining the quantity cc).

First, expand MijdM_{ij}^{\mathsf{d}} and MijM_{ij}^{\star} to see

where the middle line depends on (F.1) and (F.2), and the last inequality arises since σ(Ui,2+Vj,2)/pvij\sigma(\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2})/\sqrt{p}\lesssim\sqrt{v_{ij}^{\star}}.

Now we move on to vijvij|v_{ij}^{\star}-v_{ij}|. By the definition of vijv_{ij}, one has

Focusing on the X\bm{X} factor, we have — with probability at least 1O(n10)1-O(n^{-10}) — that

Here, the first relation comes from the identity Xi,d(XdXd)1(Xi,d)=Xi,d(XdXd)1(Xi,d)\bm{X}_{i,\cdot}^{\mathsf{d}}(\bm{X}^{\mathsf{d}\top}\bm{X}^{\mathsf{d}})^{-1}(\bm{X}_{i,\cdot}^{\mathsf{d}})^{\top}=\overline{\bm{X}}_{i,\cdot}^{\mathsf{d}}(\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}})^{-1}(\overline{\bm{X}}_{i,\cdot}^{\mathsf{d}})^{\top}, and the inequality arises from the triangle inequality. Notice that (XdXd)11/σmin\|(\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}})^{-1}\|\lesssim 1/\sigma_{\min} and that

where the last inequality follows from (A.13b). Using the bounds (F.1) and (F.2), we continue the upper bound in (F.4) as follows

A similar bound holds for the factor Y\bm{Y}. Therefore, with high probability we have

where the last relation results from the condition on Ui,2+Vj,2\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2} (cf. (3.18b)).

Combine the bounds on MijdMij|M_{ij}^{\mathsf{d}}-M_{ij}^{\star}| and vijvij|v_{ij}^{\star}-v_{ij}| to see that with probability exceeding 1O(n10)1-O(n^{-10}),

This establishes the desired upper bound on ΔV|\Delta_{V}|. ∎

Appendix G Proof of Theorem 3

As we have argued in Section 5.1, it suffices to prove the claim for Md=XdXd=XdYd\bm{M}^{\mathsf{d}}=\bm{X}^{\mathsf{d}}\bm{X}^{\mathsf{d}\top}=\overline{\bm{X}}^{\mathsf{d}}\overline{\bm{Y}}^{\mathsf{d}\top}. For simplicity of notation, we define

Apply the decompositions in Theorem 5 to obtain

where we use the fact that YY=Σ\bm{Y}^{\star\top}\bm{Y}^{\star}=\bm{\Sigma}^{\star}. Theorem 5 tells us that ZXeii.i.d.N(0,σ2(Σ)1/p)\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p), which further implies

Now we turn to the term rem\mathsf{rem}, for which we have the following two claims.

With probability at least 1O(n10)1-O(n^{-10}), one has

With probability exceeding 1O(n10)1-O(n^{-10}), we have

Combine all of the above bounds to yield the desired result.

Plug in the definition of Θ\bm{\Theta} (cf. (G.1)) and invoke the triangle inequality again to see that

Here the last relation depends on the assumption (3.18a). Second, we have already established in this section that

with probability exceeding 1O(n10)1-O(n^{-10}). Substitute the above two facts into (G.2) to arrive at

where \max\big{\{}\|\bm{\Delta}_{\bm{X}}\|_{2,\infty},\|\bm{\Delta}_{\bm{Y}}\|_{2,\infty}\big{\}}\lesssim\frac{\sigma}{\sqrt{p\sigma_{\min}}}\sqrt{\frac{\kappa^{2}\mu r^{2}\log^{2}n}{np}} and hence

Consequently, use the triangle inequality and Cauchy-Schwarz to verify that

where (i) follows since X=UΣ1/2\bm{X}^{\star}=\bm{U}^{\star}\bm{\Sigma}^{\star 1/2} and U=1\|\bm{U}^{\star}\|=1, and (ii) makes use of (G.3) as well as the facts Y,X=σmax\|\bm{Y}^{\star}\|,\|\bm{X}^{\star}\|=\sqrt{\sigma_{\max}} . In addition, invoke Lemma 9 to see that ZXΣ1/2\bm{Z}_{\bm{X}}\bm{\Sigma}^{\star 1/2} and ZYΣ1/2\bm{Z}_{\bm{Y}}\bm{\Sigma}^{\star 1/2} are both Gaussian matrices with i.i.d. N(0,σ2/p)\mathcal{N}(0,\sigma^{2}/p) entries, which together with standard concentration results implies that

with the proviso that npκ3μrlog3nnp\gtrsim\kappa^{3}\mu r\log^{3}n. This means that, with high probability,

Everything then boils down to controlling \mathsf{Tr}\big{(}\bm{Z}_{\bm{X},\bm{E}}\bm{Y}^{\star\top}\bm{Z}_{\bm{Y},\bm{E}}\bm{X}^{\star\top}\big{)}. Towards this end, we first note that

Similarly, ZY,EX=[PΩ(E)]UU/p\bm{Z}_{\bm{Y},\bm{E}}\bm{X}^{\star\top}=[\mathcal{P}_{\Omega}(\bm{E})]^{\top}\bm{U}^{\star}\bm{U}^{\star\top}/p. These identities allow us to derive

Apply the same arguments in controlling (D.21) to obtain that with probability at least 1O(n10)1-O(n^{-10}),

as long as nrlog2nn\gtrsim r\log^{2}n. This combined with (G.6) yields the desired claim. ∎

Appendix H Proof of lower bounds

Fix any ε>0\varepsilon>0. It suffices to prove that the matrix CRLB(Xi,Ω)\mathsf{CRLB}(\bm{X}_{i,\cdot}^{\star}\mid\Omega) defined in (3.27) satisfies

with probability at least 1O(n10)1-O(n^{-10}), provided that npC0ε2κ4μrnp\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu r. Towards this end, we first compute

where we recall that δik=\mathds1{(i,k)Ω}\delta_{ik}=\operatorname{\mathds{1}}\{(i,k)\in\Omega\}. Next, define the following event

where C>0C>0 is some large absolute constant. On the event E\mathcal{E}, in view of the fact σminIrΣσmaxIr\sigma_{\min}\bm{I}_{r}\preceq\bm{\Sigma}^{\star}\preceq\sigma_{\max}\bm{I}_{r}, one has

with the proviso that np4C2κ2μrlognnp\geq 4C^{2}\kappa^{2}\mu r\log n. This further implies that

on the event E\mathcal{E}. Clearly, the requirement (H.1) holds true if npC0ε2κ4μrlognnp\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu r\log n with C0=4C2C_{0}=4C^{2}.

To finish up, we are left with proving that E\mathcal{E} occurs with probability at least 1O(n10)1-O(n^{-10}). Invoke the matrix Bernstein inequality to show that

holds with probability at least 1O(n10)1-O(n^{-10}), where we define

Here we have used the incoherence condition (A.17). Consequently, one reaches the conclusion that with probability exceeding 1O(n10)1-O(n^{-10}),

as long as npμrlognnp\gg\mu r\log n, thus concluding the proof.

H.2 Proof of Lemma 2

The proof strategy is similar to the one used in proving Lemma 1 (cf. Appendix H.1). Fix any ε>0\varepsilon>0. It is sufficient to establish the following inequality

where the scalar CRLB(MijΩ)\mathsf{CRLB}(M_{ij}^{\star}\mid\Omega) is defined in (3.28) and vijv_{ij}^{\star} is defined in Theorem 2. Expand the left-hand side to reach

where the last line follows from the observations that Yj,2σmaxVj,2\|\bm{Y}_{j,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{V}_{j,\cdot}^{\star}\|_{2} and Xi,2σmaxUi,2\|\bm{X}_{i,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{U}_{i,\cdot}^{\star}\|_{2}.

where C>0C>0 is some large universal constant. Two observations are sufficient to derive the desired the result (H.2). First, the event E2\mathcal{E}_{2} happens with probability at least 1O(n10)1-O(n^{-10}) — an easy consequence of the proof of Lemma 1 (cf. Appendix H.1). Second, on the event E2\mathcal{E}_{2}, repeating the same proof of Lemma 1 (cf. Appendix H.1), one can deduce that

Comparing (H.2) and (H.3), one arrives at the desired result as long as np4C2ε2κ4μrlognnp\geq 4C^{2}\varepsilon^{-2}\kappa^{4}\mu r\log n.

Appendix I Proofs in Section A

We start with (A.13a). Invoke the triangle inequality to get

where the last relation depends on the unitary invariance of the operator norm and (A.9b). It then boils down to controlling FdF\|\bm{F}^{\mathsf{d}}-\bm{F}\|. Notice that

where the last inequality uses YF2X\|\bm{Y}\|\leq\|\bm{F}\|\leq 2\|\bm{X}^{\star}\| (cf. (A.19)), the fact that λσnp\lambda\lesssim\sigma\sqrt{np} (see (A.6)), the bound (C.16) and the condition n5κn^{5}\gg\kappa. Apply the perturbation bound for matrix square roots (see Lemma 13) to obtain that

Here, (i) uses the facts that (XX)11/σmin\|(\bm{X}^{\top}\bm{X})^{-1}\|\lesssim 1/\sigma_{\min} and that λmin[(Ir+λ/p(XX)1)1/2]1\lambda_{\min}[(\bm{I}_{r}+\lambda/p(\bm{X}^{\top}\bm{X})^{-1})^{1/2}]\geq 1, and (ii) follows from the condition that λσnp\lambda\lesssim\sigma\sqrt{np} (see (A.6)). Combine the above two bounds with F2X\|\bm{F}\|\leq 2\|\bm{X}^{\star}\| (cf. (A.19)) to reach

Moving on to (A.13b), we apply the triangle inequality and (I.3) to see that

In order to control HdH\|\bm{H}^{\mathsf{d}}-\bm{H}\|, we leverage [MWCC17, Lemma 36] to get

where the last relation uses (I.2) and FXσmax\left\|\bm{F}^{\star}\right\|\asymp\left\|\bm{X}^{\star}\right\|\asymp\sqrt{\sigma_{\max}}. Taking these bounds collectively yields

Now we turn attention to (A.13d). Observe that

where the last bound arises from (A.9c). Going through the same calculation as in bounding FdF\|\bm{F}^{\mathsf{d}}-\bm{F}\|, we arrive at

as long as σn/pσmin\sigma\sqrt{n/p}\ll\sigma_{\min}. We can thus continue the upper bound in (I.5) to derive

Here, the second line results from (I.4).

Finally, we deal with (A.13e). From the definition of the de-shrunken estimator (3.8), we have

This combined with the triangle inequality reveals that

Making use of (A.11) and (C.16) allows us to establish the claim.

I.2 Proof of the inequalities (A.16)

The proofs of (A.16a) and (A.16b) are the same as those of (A.13b) and (A.13d), and are hence omitted for conciseness. We are left with (A.16c). Denoting

as long as σn/pσmin/κ\sigma\sqrt{n/p}\ll\sigma_{\min}/\kappa. Here the first inequality follows from (A.13a). In addition, we have

Regarding θ\theta, one can apply the bound (C.16) for (X,Y)(\bm{X},\bm{Y}) and a similar bound for (X(j),Y(j))(\bm{X}^{(j)},\bm{Y}^{(j)}) to obtain

Returning to (I.6), one has by the triangle inequality that

we can apply the perturbation bound for matrix square roots (see Lemma 13) to obtain

where the penultimate relation uses (A.14a) as well as the fact that F2,σminr/n\|\bm{F}^{\star}\|_{2,\infty}\geq\sqrt{\sigma_{\min}r/n}.

With the above bound in place, we are ready to invoke [CCF+19, Lemma 22] to obtain

where the last line comes from (A.14a). This concludes the proof.

Appendix J Technical lemmas

This section collects a few useful matrix perturbation bounds. The first one is concerned with the perturbation of pseudo-inverses.

Let A\bm{A}^{\dagger} (resp. B\bm{B}^{\dagger}) be the pseudo-inverse (i.e. Moore–Penrose inverse) of A\bm{A} (resp. B\bm{B}). Then we have

The next lemma focuses on the perturbation bound for matrix square roots.

Consider two symmetric matrices obeying A1μ1I\bm{A}_{1}\succeq\mu_{1}\bm{I} and A2μ2I\bm{A}_{2}\succeq\mu_{2}\bm{I} for some μ1,μ2>0\mu_{1},\mu_{2}>0. Let R10\bm{R}_{1}\succeq\bm{0} (resp. R20\bm{R}_{2}\succeq\bm{0}) be the (principal) matrix square root of A1\bm{A}_{1} (resp. A2\bm{A}_{2}). Then one has

The following lemma concerns the perturbation of top-rr components of matrices.

From Wedin’s sinΘ\sin\bm{\Theta} theorem [Wed72], there exist orthonormal matrices R1,R2Or×r\bm{R}_{1},\bm{R}_{2}\in\mathcal{O}^{r\times r} such that

In addition, Weyl’s inequality tells us that

Here, the second inequality follows from the triangle inequality and the assumption that EM=Σ\|\bm{E}\|\leq\|\bm{M}\|=\|\bm{\Sigma}\|. Expand UΣVU^Σ^V^\bm{U}\bm{\Sigma}\bm{V}^{\top}-\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top} and apply the triangle inequality to obtain

Once again, employ (J.1) and (J.2) to arrive at

The last bound centers around the well-known Sylvester equation XA+BX=C\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}.

as long as λminIrAλmaxIr\lambda_{\min}\bm{I}_{r}\preceq\bm{A}\preceq\lambda_{\max}\bm{I}_{r} and λminIrBλmaxIr\lambda_{\min}\bm{I}_{r}\preceq\bm{B}\preceq\lambda_{\max}\bm{I}_{r} for some λmaxλmin>0\lambda_{\max}\geq\lambda_{\min}>0.

To begin with, we intend to show that under the condition λminIrA,BλmaxIr\lambda_{\min}\bm{I}_{r}\preceq\bm{A},\bm{B}\preceq\lambda_{\max}\bm{I}_{r} for some λmaxλmin>0\lambda_{\max}\geq\lambda_{\min}>0, there is a unique solution to the matrix equation XA+BX=C\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}. Use the notation of Kronecker product to obtain an equivalent form of XA+BX=C\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C} as follows

where \otimes denotes the Kronecker product and vec(A)\mathsf{vec}(\bm{A}) stands for the vectorization of the matrix A\bm{A}. Given that A0\bm{A}\succ\bm{0} and B0\bm{B}\succ\bm{0}, it is straightforward to see that AIr+IrB\bm{A}^{\top}\otimes\bm{I}_{r}+\bm{I}_{r}\otimes\bm{B} is invertible, thus justifying the uniqueness of X\bm{X}.

The next step is to characterize X\bm{X} explicitly. The argument herein is adapted from [Smi68] and [Sch92]. Specifically, it has been shown in [Smi68] that the equation XA+BX=C\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C} is equivalent to

where U=(qIr+B)1(qIrB)\bm{U}=(q\bm{I}_{r}+\bm{B})^{-1}(q\bm{I}_{r}-\bm{B}), V=(qIrA)(qIr+A)1\bm{V}=(q\bm{I}_{r}-\bm{A})(q\bm{I}_{r}+\bm{A})^{-1} and W=2q(qIr+B)1C(qIr+A)1\bm{W}=2q(q\bm{I}_{r}+\bm{B})^{-1}\bm{C}(q\bm{I}_{r}+\bm{A})^{-1}, for any q>0q>0. In particular, when q>λminq>\lambda_{\min}, the matrix

is the unique solution to XUXV=W\bm{X}-\bm{U}\bm{X}\bm{V}=\bm{W} and hence to XA+BX=C\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}. To show this, it suffices to verify that the matrix series is convergent. Note that when q>λminq>\lambda_{\min}, one has

and similarly V(qλmin)/(q+λmax)<1\|\bm{V}\|\leq(q-\lambda_{\min})/(q+\lambda_{\max})<1. These two bounds taken together immediately establish the convergence of the matrix series (J.5).

In the end, the explicit representation (J.5) allows us to upper bound X\|\bm{X}\|. A little algebra reveals that

where we make use of the fact UV<1\|\bm{U}\|\|\bm{V}\|<1. In addition, from the definition of W\bm{W} we know that

provided that q>0q>0. Combine this with the bounds on U\|\bm{U}\| and V\|\bm{V}\| to reach

References