Inference and Uncertainty Quantification for Noisy Matrix Completion

Yuxin Chen, Jianqing Fan, Cong Ma, Yuling Yan

stat.ML cs.IT cs.LG eess.SP math.OC math.ST

Introduction

Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed to us [Sre04, CR09, KMO10a]. Tackling this problem in large-scale applications is computationally challenging, due to the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are contaminated by a certain amount of noise.

Fortunately, if the entries of the unknown matrix are sufficiently de-localized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms — including both convex and nonconvex approaches — that allow to fill in unseen entries faithfully given only partial noisy samples [CP10, NW12, KLT11, KMO10b, CW15, MWCC17, CCF+19]. Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how to use it to compute a short interval that is likely to contain a missing entry?

Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is generally very challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.

It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems in high-dimensional statistics, including Lasso [ZZ14, vdGBRD14, JM14a], generalized linear models [vdGBRD14, NL17, BFL+18], graphical models [JVDG15, RSZZ15, MLL17]), amongst others. Very little work, however, has looked into noisy matrix completion along this direction. While non-asymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, most, if not all, of the estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees, or come with an unspecified (but often enormous) pre-constant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial over-coverage.

2 A glimpse of our contributions

This paper takes a substantial step towards efficient inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the commonly used convex and nonconvex estimators. The resulting de-biased estimators admit nearly accurate non-asymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. While details of our main findings are postponed to Section 3, we would like to immediately single out a few important merits of the proposed inferential procedures and theory:

Our results enable two types of uncertainty assessment, namely, we can construct (i) confidence intervals for each entry — either observed or missing — of the unknown matrix; (ii) confidence regions for the low-rank factors of interest (modulo some unavoidable global ambiguity).

Despite the complicated statistical dependency, our procedure and theory do not rely on sample splitting, thus avoiding the unnecessary widening of confidence intervals / regions due to insufficient data usage.

The confidence intervals / regions constructed based on the proposed procedures are, in some sense, optimal.

We present a unified approach that accommodates both convex and nonconvex estimators seamlessly.

As a byproduct, we characterize the Euclidean estimation errors of the proposed de-biased estimators. Such error bounds are sharp and match an oracle lower bound precisely (including the pre-constant). To the best of our knowledge, this is the first theory that demonstrates that a computationally feasible algorithm can achieve the statistical limit including the pre-constant.

All of this is built upon the intimate link between convex and nonconvex estimators [CCF+19], as well as the recent advances in analyzing the stability of nonconvex optimization against random noise [MWCC17].

Models and notation

To cast the noisy matrix completion problem in concrete statistical settings, we adopt a model commonly studied in the literature [CR09]. We also introduce some useful notation.

Observation models.

What we observe is a random subset of noisy entries of $\bm{M}^{\star}$ ; more specifically, we observe

Incoherence conditions.

Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition [CR09, Che15] on the singular subspaces of $\bm{M}^{\star}$ (i.e. $\bm{U}^{\star}$ and $\bm{V}^{\star}$ ):

Asymptotic notation.

Here, $f(n)\lesssim h(n)$ (or $f(n)=O(h(n))$ ) means $|f(n)|\leq c_{1}|h(n)|$ for some constant $c_{1}>0$ , $f(n)\gtrsim h(n)$ means $|f(n)|\geq c_{2}|h(n)|$ for some constant $c_{2}>0$ , $f(n)\asymp h(n)$ means $c_{2}|h(n)|\leq|f(n)|\leq c_{1}|h(n)|$ for some constants $c_{1},c_{2}>0$ , and $f(n)=o(h(n))$ means $\lim_{n\rightarrow\infty}f(n)/h(n)=0$ . We write $f(n)\ll h(n)$ to indicate that $|f(n)|\leq c_{1}|h(n)|$ for some small constant $c_{1}>0$ (much smaller than 1), and use $f(n)\gg h(n)$ to indicate that $|f(n)|\geq c_{2}|h(n)|$ for some large constant $c_{2}>0$ (much larger than 1).

Inferential procedures and main results

The proposed inferential procedure lays its basis on two of the most popular estimation paradigms — convex relaxation and nonconvex optimization — designed for noisy matrix completion. Recognizing the complicated bias of these two highly nonlinear estimators, we shall first illustrate how to perform bias correction, followed by a distributional theory that establishes the near-Gaussianity and optimality of the proposed de-biased estimators.

We first review in passing two tractable estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion. The readers familiar with this literature can proceed directly to Section 3.2.

Recall that the rank function $\mathsf{rank}(\cdot)$ is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate [Faz02, RFP10]; for example, one can consider the following penalized least-squares convex program

or using our notation $\mathcal{P}_{\Omega}$ ,

Here, $\|\cdot\|_{*}$ is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and $\lambda>0$ is some regularization parameter. Under mild conditions, the solution to the convex program (3.1) provably attains near-optimal estimation accuracy (in an order-wise sense), provided that a proper regularization parameter $\lambda$ is adopted [CCF+19].

Nonconvex optimization.

Intimate connections between convex and nonconvex estimates.

Denote by $\bm{Z}^{\mathsf{cvx}}$ any minimizer of the convex program (3.1), and denote by $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ the estimate returned by Algorithm 1 aimed at solving (3.4). As was recently shown in [CCF+19], when the regularization parameter $\lambda$ is properly chosen, these two estimates obey (see (A.12) in Appendix A.2 for a precise statement)

2 Constructing de-biased estimators

We are now well equipped to describe how to construct new estimators based on the convex estimate $\bm{Z}^{\mathsf{cvx}}$ and the nonconvex estimate $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ , so as to enable statistical inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we shall abuse notation by using the shorthand $\bm{Z},\bm{X},\bm{Y}$ for both convex and nonconvex estimates; see Table 1 and Appendix B for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.

Given that both (3.1) and (3.4) are regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from non-negligible bias. In order to enable desired statistical inference, it is natural to first correct the estimation bias.

A natural de-biasing strategy that immediately comes to mind is the following simple linear transformation (recall the notation in Table 1):

To remedy this issue, we propose to further project $\bm{Z}^{0}$ onto the set of rank- $r$ matrices, leading to the following de-biased estimator

The estimator (3.7) can be viewed as performing one iteration of singular value projection (SVP) [MJD09, DC18] on the current estimate $\bm{Z}$ .

The estimator (3.7) also bears a similarity to the de-biased estimator proposed by [Xia18] for low-rank trace regression; the disparity between them shall be discussed in Section 4.

An equivalent form: a de-shrunken estimator for the low-rank factors.

It turns out that the de-biased estimator (3.7) admits another almost equivalent representation that offers further insights. Specifically, we consider the following de-shrunken estimator for the low-rank factors

where we recall the definition of $\bm{X}$ and $\bm{Y}$ in Table 1. To develop some intuition regarding why this is called a de-shrunken estimator, let us look at a simple scenario where $\bm{U}\bm{\Sigma}\bm{V}^{\top}$ is the SVD of $\bm{X}\bm{Y}^{\top}$ and $\bm{X}=\bm{U}\bm{\Sigma}^{1/2}$ , $\bm{Y}=\bm{V}\bm{\Sigma}^{1/2}$ . It is then self-evident that

In words, $\bm{X}^{\mathsf{d}}$ and $\bm{Y}^{\mathsf{d}}$ are obtained by de-shrinking the spectrum of $\bm{X}$ and $\bm{Y}$ properly.

As we shall formalize in Section 5.1, the de-shrunken estimator (3.8) for the low-rank factors is nearly equivalent to the de-biased estimator (3.7) for the whole matrix, in the sense that

3 Main results: distributional guarantees

The proposed estimators admit tractable distributional characterizations in the large- $n$ regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around two types of inferential problems:

Each entry of the matrix $\bm{M}^{\star}$ : the entry can be either missing (i.e. predicting an unseen entry) or observed (i.e. de-noising an observed entry). For example, in the Netflix challenge, one would like to infer a user’s preference about any movie, given partially revealed ratings [CR09]. Mathematically, this seeks to determine the distribution of

Suppose that the sample size and the noise obey

A more complete version can be found in Theorem 5.

Another interesting feature — which we shall make precise in the proof of this theorem — is that: for any given $1\leq i,j\leq n$ , the two random vectors $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.

Theorem 1 is a non-asymptotic result. In words, Theorem 1 decomposes the estimation error $\bm{X}^{\mathsf{d}}\bm{H}^{\mathsf{d}}-\bm{X}^{\star}$ (resp. $\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}}-\bm{Y}^{\star}$ ) into a Gaussian component $\bm{Z}_{\bm{X}}$ (resp. $\bm{Z}_{\bm{Y}}$ ) and a residual term $\bm{\Psi}_{\bm{X}}$ (resp. $\bm{\Psi}_{\bm{Y}}$ ). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to $\bm{Z}_{\bm{X}}$ and $\bm{Z}_{\bm{Y}}$ . To see this, it is helpful to leverage the Gaussianity (3.15a) and compute that: for each $1\leq j\leq n$ , the $j$ th row of $\bm{Z}_{\bm{X}}$ obeys

in other words, the typical size of the $j$ th row of $\bm{Z}_{\bm{X}}$ is no smaller than the order of $\sigma\sqrt{r/(p\sigma_{\max})}$ . In comparison, the size of each row of $\bm{\Psi}_{\bm{X}}$ (see (3.16)) is much smaller than $\sigma\sqrt{r/(p\sigma_{\max})}$ (and hence smaller than the size of the corresponding row of $\bm{Z}_{\bm{X}}$ ) with high probability, provided that (3.13) is satisfied.

Equipped with the above master decompositions of the low-rank factors and Remark 4, we are ready to present a similar decomposition for the entry $M_{ij}^{\mathsf{d}}-M_{ij}^{\star}$ .

For each $1\leq i,j\leq n$ , define the variance $v_{ij}^{\star}$ as

where $\bm{U}_{i,\cdot}^{\star}$ (resp. $\bm{V}_{j,\cdot}^{\star}$ ) denotes the $i$ th (resp. $j$ th) row of $\bm{U}^{\star}$ (resp. $\bm{V}^{\star}$ ). Suppose that

where $g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star})$ and the residual obeys $|\Delta_{ij}|=o(\sqrt{v_{ij}^{\star}})$ with probability exceeding $1-O(n^{-3})$ .

In the symmetric case where the noise $\bm{E}$ , the truth $\bm{M}^{\star}$ , and the sampling pattern are all symmetric (i.e. $\mathcal{P}_{\Omega}(\bm{E})=\big{(}\mathcal{P}_{\Omega}(\bm{E})\big{)}^{\top}$ and $\bm{M}^{\star}=\bm{M}^{\star\top}$ ), the variance $v_{ii}^{\star}$ (cf. (3.17)) for the diagonal entries has a different formula; more specifically, it is straightforward to extend our theory to show that

This additional multiplicative factor of 2 arises since $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{i}$ are identical (and hence not independent) in this symmetric case. The variance formula for any $v_{ij}^{\star}$ ( $i\neq j$ ) remains unchanged.

Several remarks are in order. To begin with, we develop some intuition regarding where the variance $v_{ij}^{\star}$ comes from. By virtue of Theorem 1, one has the following Gaussian approximation

Assuming that the first-order expansion is reasonably tight, one has

According to Remark 4, $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ are nearly independent. It is thus straightforward to compute the variance of (3.20) as

Here, (i) relies on (3.20) and the near independence between $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ ; (ii) uses the variance formula in Theorem 1; (iii) arises from the definitions of $\bm{X}^{\star}$ and $\bm{Y}^{\star}$ (cf. (2.2)). This computation explains (heuristically) the variance formula $v_{ij}^{\star}$ .

Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions (3.18), it in turn allows us to construct nearly accurate confidence intervals for each matrix entry $M_{ij}^{\star}$ . This is formally summarized in the following corollary, the proof of which is deferred to Appendix F. Here and throughout, we use $[a\pm b]$ to denote the interval $[a-b,a+b]$ .

Denote by $\Phi(t)$ the CDF of a standard Gaussian random variable and by $\Phi^{-1}(\cdot)$ its inverse function. Let

be the empirical estimate of the theoretical variance $v_{ij}^{\star}$ . Then one has

In words, Corollary 1 tells us that for any fixed significance level $0<\alpha<1$ , the interval

is a nearly accurate two-sided $(1-\alpha)$ confidence interval of $M_{ij}^{\star}$ .

In addition, we remark that when $\|\bm{U}_{i,\cdot}^{\star}\|_{2}=\|\bm{V}_{j,\cdot}^{\star}\|_{2}=0$ (and hence $V_{ij}^{\star}=0$ ), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that

Last but not least, the careful readers might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with $r,\mu,\kappa=O(1)$ for example: our conditions read

4 Lower bounds and optimality for inference

It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the de-biased estimator is optimal in some sense; for instance, it nearly attains the minimum covariance among all unbiased estimators. To formalize this claim, we shall

Quantify the performance of two ideal estimators with the assistance of an oracle;

Demonstrate that the performance of our de-biased estimators is arbitrarily close to that of the ideal estimators.

In what follows, we denote by $\bm{X}_{i,\cdot}^{\star}$ (resp. $\bm{Y}_{i,\cdot}^{\star}$ ) the $i$ th row of $\bm{X}^{\star}$ (resp. $\bm{Y}^{\star}$ ).

Suppose that there is an oracle informing us of $\bm{Y}^{\star}$ , and that we observe the same set of data as in (2.3). Under such an idealistic setting and for any given $1\leq i\leq n$ , the following least-squares estimator achieves the minimum covariance among all unbiased estimators for the $i$ th row $\bm{X}_{i,\cdot}^{\star}$ of $\bm{X}^{\star}$ (see e.g. [Sha03, Theorem 3.7])

In other words, for any unbiased estimator $\bm{u}$ of $\bm{X}_{i,\cdot}^{\star}$ (conditional on $\Omega$ ), one has

where $\mathsf{Cov}\big{(}\bm{X}_{i,\cdot}^{\mathsf{ideal}}\,\big{|}\,\Omega\big{)}$ is precisely the Cramér-Rao lower bound (conditional on $\Omega$ ) under this ideal setting. As it turns out, with high probability, this lower bound concentrates around $\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p$ , as stated in the following lemma. The proof is postponed to Appendix H.1.

Fix an arbitrarily small constant $\varepsilon>0$ . Suppose that $n^{2}p\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu rn$ for some sufficiently large constant $C_{0}>0$ independent of $n$ . Then with probability at least $1-O(n^{-10})$ , one has

Given that $\varepsilon$ can be an arbitrarily small constant, Lemma 1 uncovers that the covariance of the de-shrunken estimator $\bm{X}_{i,\cdot}^{\mathsf{d}}$ (cf. Theorem 1) matches that of the ideal estimator $\bm{X}_{i,\cdot}^{\mathsf{ideal}}$ , thus achieving the Cramér-Rao lower bound with high probability. The same conclusion applies to $\bm{Y}_{j,\cdot}^{\mathsf{d}}$ as well.

Suppose that there is another oracle informing us of $\{\bm{X}_{k,\cdot}^{\star}\}_{k:k\neq i}$ and $\{\bm{Y}_{k,\cdot}^{\star}\}_{k:k\neq j}$ ; that is, everything about $\bm{X}^{\star}$ except $\bm{X}_{i,\cdot}^{\star}$ and everything about $\bm{Y}^{\star}$ except $\bm{Y}_{j,\cdot}^{\star}$ . In addition, we observe the same set of data as in (2.3), except that we do not get to see $M_{ij}$ .The exclusion of $M_{ij}$ is merely for ease of presentation. One can consider the model where all $M_{ij}$ with $(i,j)\in\Omega$ are observed with a slightly more complicated argument. Under this idealistic model, the Cramér-Rao lower bound [Sha03, Theorem 3.3] for estimating $M_{ij}^{\star}=\bm{X}_{i,\cdot}^{\star}(\bm{Y}_{j,\cdot}^{\star})^{\top}$ can be computed as

This means that any unbiased estimator of $M_{ij}^{\star}$ must have variance no smaller than $\mathsf{CRLB}(M_{ij}^{\star}\mid\Omega)$ . This quantity admits a much simpler lower bound as follows, whose proof can be found in Appendix H.2.

Fix an arbitrarily small constant $\varepsilon>0$ . Suppose that $n^{2}p\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu rn\log n$ for some sufficiently large constant $C_{0}>0$ independent of $n$ . Then with probability at least $1-O(n^{-10})$ ,

where $v_{ij}^{\star}$ is defined in Theorem 2.

Similar to Lemma 1, Lemma 2 reveals that the variance of our de-biased estimator $M_{ij}^{\mathsf{d}}$ (cf. Theorem 2) — which certainly does not have access to the side information provided by the oracle — is arbitrarily close to the Cramér-Rao lower bound aided by an oracle.

All in all, the above lower bounds demonstrate that the degrees of uncertainty underlying our de-shrunken low-rank factors and de-biased matrix are, in some sense, statistically minimal.

5 Back to estimation: the de-biased estimator is optimal

While the emphasis of the current paper is on inference, we would nevertheless like to single out an important consequence that informs the estimation step. To be specific, the decompositions and distributional guarantees derived in Theorem 1 and Theorem 2 allow us to track the estimation accuracy of $\bm{M}^{\mathsf{d}}$ , as stated in the following theorem. The proof of this result is postponed to Appendix G.

In stark contrast to prior statistical estimation guarantees (e.g. [CP10, NW12, KLT11, CCF+19]), Theorem 3 pins down the estimation error of the proposed de-biased estimator in a sharp manner (namely, even the pre-constant is fully determined). Encouragingly, there is a sense in which the proposed de-biased estimator achieves the best possible statistical estimation accuracy, as revealed by the following result.

Fix an arbitrarily small constant $\varepsilon>0$ . Suppose that $n^{2}p\gtrsim\mu rn\log^{2}n$ , and that $r=o(n)$ . Then with probability exceeding $1-O(n^{-10})$ , any unbiased estimator $\widehat{\bm{M}}$ of $\bm{M}^{\star}$ obeys

Intuitively, the term $2nr$ reflects approximately the underlying degrees of freedom in the true subspace $T^{\star}$ of interest (i.e. the tangent space of the rank- $r$ matrices at the truth $\bm{M}^{\star}$ ), whereas the factor $1/p$ captures the effect due to sub-sampling. This result has already been established in [CP10, Section III.B] (together with [CR09, Theorem 4.1]). We thus omit the proof for conciseness. The key idea is to consider an oracle informing us of the true tangent space $T^{\star}$ . ∎

The implication of the above two theorems is remarkable: the de-biasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing the estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the pre-constant. As far as we know, this is the first theory about a polynomial time algorithm that matches the statistical limit in terms of the pre-constant. This intriguing finding is further corroborated by numerical experiments; see Section 3.6 for details (in particular, Figure 3).

6 Numerical experiments

We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 1 and Theorem 2. Note that our main results hold for the de-biased estimators built upon $\bm{Z}^{\mathsf{cvx}}$ and $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ . As we will formalize shortly in Section 5.1, these two de-biased estimators are extremely close to each other; see also Figure 4 for experimental evidence. Therefore, in order to save space, we use the de-biased estimator built upon the convex estimate $\bm{Z}^{\mathsf{cvx}}$ throughout the experiments.

We begin by checking the validity of Theorem 1. Suppose that one is interested in estimating the inner product $\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j}$ between $\bm{X}^{\star\top}\bm{e}_{i}$ and $\bm{X}^{\star\top}\bm{e}_{j}$ ( $i\neq j$ ). In the Netflix challenge, this might correspond to the similarity between the $i$ th user and the $j$ th one. As a straightforward consequence of Theorem 1, the normalized estimation error

is extremely close to a standard Gaussian random variable. Here, similar to (3.22), we let

be the empirical estimate of the theoretically predicted variance $\sigma^{2}(\|\bm{U}_{i,\cdot}^{\star}\|_{2}^{2}+\|\bm{U}_{j,\cdot}^{\star}\|_{2}^{2})/p$ . As a result, a $95\%$ confidence interval of $\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j}$ would be $[\bm{e}_{i}^{\top}\bm{X}^{\mathsf{d}}\bm{X}^{\mathsf{d}\top}\bm{e}_{j}\pm 1.96\sqrt{\rho_{ij}}]$ . For each $(i,j)$ , we define $\widehat{\mathsf{Cov}}_{\mathsf{L},(i,j)}$ to be the empirical coverage rate of $\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{X}^{\star\top}\bm{e}_{j}$ over $200$ Monte Carlo simulations. Correspondingly, denote by $\mathsf{Mean}(\widehat{\mathsf{Cov}}_{\mathsf{L}})$ (resp. $\mathsf{Std}(\widehat{\mathsf{Cov}}_{\mathsf{L}})$ ) the average (resp. the standard deviation) of $\widehat{\mathsf{Cov}}_{\mathsf{L},(i,j)}$ over indices $1\leq i<j\leq n$ . Table 2 collects the simulation results for different values of $(r,p,\sigma)$ . As can be seen, the reported empirical coverage rates are reasonably close to the nominal level $95\%$ . In addition, Figure 1 depicts the Q-Q (quantile-quantile) plots of $T_{12},T_{13}$ and $T_{14}$ vs. the standard Gaussian random variable over 200 Monte Carlo simulations for $r=5$ , $p=0.4$ and $\sigma=10^{-3}$ . It is clearly seen that all of these are well approximated by a standard Gaussian random variable.

Next, we turn to Theorem 2, namely the distributional guarantee for the entries of the matrix. Denote

where $v_{ij}$ is the empirical variance defined in (3.22). In view of the $95\%$ confidence interval predicted by Corollary 1, and similar to what have done for the low-rank components, for each $(i,j)$ , we define $\widehat{\mathsf{Cov}}_{\mathsf{E},(i,j)}$ to be the empirical coverage rate of $M^{\star}_{ij}$ over $200$ Monte Carlo simulations. Correspondingly, denote by $\mathsf{Mean}(\widehat{\mathsf{Cov}}_{\mathsf{E}})$ (resp. $\mathsf{Std}(\widehat{\mathsf{Cov}}_{\mathsf{E}})$ ) the average (resp. the standard deviation) of $\widehat{\mathsf{Cov}}_{\mathsf{E},(i,j)}$ over indices $1\leq i,j\leq n$ . As before, Table 3 gathers the empirical coverage rates for $M^{\star}_{ij}$ and Figure 2 displays the Q-Q (quantile-quantile) plots of $S_{11},S_{12}$ and $S_{13}$ vs. the standard Gaussian random variable over 200 Monte Carlo trials for $r=5$ , $p=0.4$ and $\sigma=10^{-3}$ . It is evident that the distribution of $S_{ij}$ matches that of $\mathcal{N}(0,1)$ reasonably well.

We conclude this section with experiments on real data. Similar to [CP10], we use the daily temperature data [NCD19] for 1400 stations across the world in 2018, which results in a $1400\times 365$ data matrix. Inspection on the singular values reveals that the data matrix is nearly low-rank. We vary the observation probability $p$ from $0.5$ to $0.9$ and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain $95\%$ confidence intervals for all the entries. Table 4 reports the empirical coverage probabilities, the average length of the confidence intervals as well as the estimation error of both $\bm{Z}^{\mathsf{cvx}}$ and $\bm{M}^{\mathsf{d}}$ over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to $95\%$ and the confidence intervals are also quite short. In addition, the estimation error of $\bm{M}^{\mathsf{d}}$ is smaller than that of $\bm{Z}^{\mathsf{cvx}}$ , which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that (1) the underlying true temperature matrix is only approximately low-rank, and (2) the noise in the temperature might not be independent.

7 A bit of intuition

We pause to develop some intuition behind the distributional guarantees for the proposed estimators. Bearing in mind the intimate link between convex and nonconvex optimization (cf. (3.5)), it suffices to concentrate on the nonconvex problem (3.4). For the sake of clarity, we further restrict attention to the rank-1 positive semidefinite case where $\bm{M}^{\star}=\bm{x}^{\star}\bm{x}^{\star\top}$ and set $\lambda=0$ , where one can focus on

Any optimizer $\widehat{\bm{x}}$ of (3.34) would necessarily satisfy the first-order optimality condition

We shall also assume that $\widehat{\bm{x}}$ is a reasonably reliable estimate obeying $\widehat{\bm{x}}\approx\bm{x}^{\star}$ .

We begin with the no-missing-data case (i.e. $p=1$ ), which already conveys the key insight. The condition (3.35) simplifies to

which, through a little manipulation, leads to an equivalent decomposition:

Then: (1) the third term of (3.37), which can be viewed as a second-order term (in the sense that it is a quadratic term of $\widehat{\bm{x}}-\bm{x}^{\star}$ ), becomes vanishingly small when $\widehat{\bm{x}}\approx\bm{x}^{\star}$ ; (2) while the second term of (3.37) looks like a first-order term, it is natural to conjecture that $\widehat{\bm{x}}-\bm{x}^{\star}$ is sufficiently random and hence $(\widehat{\bm{x}}-\bm{x}^{\star})^{\top}\bm{x}^{\star}\ll\|\widehat{\bm{x}}-\bm{x}^{\star}\|_{2}\|\bm{x}^{\star}\|_{2}$ (i.e. the estimation error is not aligned with $\bm{x}^{\star}$ ), meaning that this term is also expected to be negligible compared to a typical first-order term (e.g. the term on the left-hand side of 3.37). In summary, these non-rigorous arguments suggest that

If one can be convinced that $\bm{E}$ and $\widehat{\bm{x}}$ are only weakly dependent, then this means

Returning to the missing data scenario with $p<1$ , everything is based on the following approximation

this is certainly expected — using standard concentration arguments — if we “pretend” that $\mathcal{P}_{\Omega}$ and $\widehat{\bm{x}}$ are statistically independent. With this approximation in mind, one can translate (3.35) into

Repeating the above argument then immediately yields

The case with $\lambda>0$ can be intuitively understood in a very similar way by first de-shrinking the estimate; we omit it here for brevity. We note that these hand-waving arguments can all be made rigorous, which is the main content of the proof.

8 Inference based on spectral estimates?

One would naturally be curious about whether there are other estimation procedures that also enable reasonable statistical inference. While this is beyond the scope of the current paper, we take a moment to discuss one alternative: the spectral method, as pioneered by [KMO10a, KMO10b] in the matrix completion problem. In a nutshell, this approach consists in computing a rank- $r$ approximation to $\mathcal{P}_{\Omega}(\bm{M})/p$ , which is precisely the spectral initialization widely used in a two-stage nonconvex algorithm (cf. Algorithm 1) [KMO10b, SL16, CW15, CCF18, MWCC17]. While inference has not been, as far as we know, the focus of prior work on spectral methods,We note that inference from spectral estimates has been investigated in other context beyond matrix completion (e.g. the model without missing data [Xia19, FFHL19]). the recent papers [AFWZ17, MWCC17] hinted at the possibility of characterizing the distribution of the spectral estimate. Take a simple symmetric rank-1 case for example (i.e. $\bm{M}^{\star}=\bm{x}^{\star}\bm{x}^{\star\top}$ with $\|\bm{x}^{\star}\|_{2}=1$ ): the leading eigenvector $\bm{u}^{\mathsf{spectral}}$ of $\mathcal{P}_{\Omega}(\bm{M})/p$ often admits the following approximation (up to a global sign)

Expanding $\mathcal{P}_{\Omega}(\bm{M})=p\bm{x}^{\star}\bm{x}^{\star\top}+\mathcal{P}_{\Omega}(\bm{x}^{\star}\bm{x}^{\star\top})-p\bm{x}^{\star}\bm{x}^{\star\top}+\mathcal{P}_{\Omega}(\bm{E})$ , we arrive at

In words, two major factors dictate the uncertainty of the spectral estimate: (1) the additive Gaussian noise (cf. the 1st term on the right-hand side of (3.42)), and (2) random sub-sampling (in particular, the randomness incurred by employing the sub-sampled $\mathcal{P}_{\Omega}(\bm{x}^{\star}\bm{x}^{\star\top})/p$ to approximate the truth $\bm{x}^{\star}\bm{x}^{\star\top}$ ). Given that the random sub-sampling effect cannot be ignored at all, the spectral estimates often suffer from a much larger estimation error (and hence a higher degree of uncertainty) compared to either the convex or the nonconvex estimates. In truth, this random sub-sampling effect does not go away even when the noise vanishes. Consequently, uncertainty quantification based on the spectral estimates may not be the most desirable option.

Prior art

Low-rank matrix completion, or more broadly, low-rank matrix recovery, is a fundamental task that permeates through a wide spectrum of applications in science, engineering, and finance (e.g. [RS05, SY07, CC14, FSZZ18, CCG15, ZPL15, BN06, CC18b, FWZ19, KS11, CZ16, KX15, FLM13, DR17, CDDD19, DPVW14, SZ12, FS11]). A paper of this length is unable to review all papers motivating and contributing to this enormous subject; interested readers are referred to [DR16, CC18a] for extensive discussions of motivating applications as well as the exciting recent development.

Numerous algorithms have been proposed to solve this problem efficiently, with two paradigms being arguably the most widely used: convex relaxation and nonconvex optimization. We briefly review the literature contributing to these two paradigms.

Nonconvex optimization algorithms, as pioneered by [KMO10a, Sre04], become increasingly more popular for solving various low-rank factorization problems, due to their appealing computational complexities [JNS13, CLS15, CC17, TBS+16, SL16, ZL16, CCFM19, WZG16, CLL19]. For instance, the gradient-based nonconvex methods have been analyzed for noisy matrix completion [KMO10b, CW15, MWCC17, CCF+19], which are shown to achieve near-optimal statistical accuracy and linear-time convergence guarantees all at once. Going beyond gradient methods, we note that other nonconvex methods (e.g. [RS05, JMD10, WYZ12, JNS13, FRW11, Van13, LXY13, Har14, JKN16, RT11, WCCL16, DC18, ZWL15, ZWYG18, MSL19, CCD+19]) and landscape properties [GLM16, CL17, GJZ17, ZSL19, ZJSL18, SXZ19] have been largely explored as well. The interested readers are referred to [CLC19] for an in-depth discussion. One limitation, however, is that the theoretical guarantees provided for nonconvex algorithms often exhibit sub-optimal dependency in the rank $r$ of the unknown matrix; for instance, most theory requires a sample complexity of at least $nr^{2}$ (in fact, often much larger than $nr^{2}$ ). This is outperformed by the convex relaxation approach.

Despite these recent developments, very little work has investigated statistical inference for noisy matrix completion. While [CKLN18, CKL16, CN15, CEGN15] discussed the construction of “honest” confidence regions, the volume of these regions is dependent on some (possibly huge) hidden constants, thus resulting in over-coverage. Perhaps the closest to our paper is the recent work [Xia18], which investigated inference for low-rank trace regression. Employing a closely related de-biased estimator with sample splitting, the paper [Xia18] established asymptotic normality of a certain projected distance between the estimate and the truth. The result therein, however, requires a sampling mechanism obeying the restricted isometry property (e.g. i.i.d. Gaussian designs), which fails to hold for matrix completion. Also, our approach does not require sample splitting — a technique that is convenient for analysis but conservative in constructing confidence regions. Another work by Cai et al. [CLR16] developed a unified approach to provide inference guarantees for linear inverse problems including low-rank matrix estimation. Their results, however, require the sample size to exceed the total dimension $n^{2}$ even under the Gaussian design. Finally, a recent line of work [MX17] explored uncertainty quantification under the Bayesian setting, hypothesizing on a special prior regarding the true matrix. This departs drastically from the scenario considered herein.

Inference in high-dimensional problems.

Inference in high-dimensional sparse regression has received much attention in the last few years [WR09, ZZ14, BCH11, vdGBRD14, JM14b, DBMM15, CG17, NL17, NNLL18, LSST16, LTTT14, MMB09, DBZ17, ZC17, BFL+18]. Our inferential approach is partly inspired by the recent developments on this topic, particularly with regard to the de-biased / de-sparsified estimators proposed for Lasso. More specifically, recognizing the non-negligible bias of the Lasso estimate

A line of work [ZZ14, vdGBRD14, JM14a] came up with a linear transformation of $\widehat{\bm{\beta}}$ of the form

Interestingly, our de-biased estimator (3.7) for matrix completion admits a very similar form as (4.2). To see this, recall that our de-biased estimator is given by

where $\bm{Z}$ can be either $\bm{Z}^{\mathsf{cvx}}$ or $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ (see Table 1). Let $T$ be the tangent space of the set of rank- $r$ matrices at $\bm{Z}^{\mathsf{cvx},r}$ (resp. $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ ) in the convex (resp. nonconvex) case, and $\mathcal{P}_{T}$ be the projection operator onto $T$ . Somewhat surprisingly, replacing $\mathcal{P}_{\text{rank-}r}$ by $\mathcal{P}_{T}$ does not affect the de-biased estimator by much, in the sense that

In addition, recognizing that $\bm{Z}$ almost lies within the tangent space $T$ ,More precisely, if $\bm{Z}=\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ , then $\bm{Z}\in T$ ; if $\bm{Z}=\bm{Z}^{\mathsf{cvx}}$ , one has $\mathcal{P}_{T}(\bm{Z})\approx\bm{Z}$ .one can rewrite

Finally, de-biased estimators have been put forward to tackle other high-dimensional problems, including but not limited to generalized linear models [vdGBRD14, NL17], graphical models [JVDG15, RSZZ15, MLL17, JvdG17], sparse PCA [JvdG18], treatment effects estimation [CCD+18, AIW18]. These are beyond the scope of the current paper.

Architecture of the proof

This section outlines the main steps for establishing Theorem 1 and Theorem 2. Before starting, we introduce some useful notation. For convenience of presentation, we insert the factor $1/p$ into (3.4) and redefine the nonconvex loss function as

In addition, for each $1\leq j,k\leq n$ , we define the indicator $\delta_{jk}\triangleq\operatorname{\mathds{1}}\{(j,k)\in\Omega\}$ , which is a Bernoulli random variable with mean $p$ .

We also note that Theorem 1 (resp. Theorem 2) is subsumed by Theorem 5 (resp. Theorem 6). As a result, we shall focus on establishing Theorem 5 (resp. Theorem 6) when it comes to estimating low-rank factors (resp. the entries of the matrix).

Instate the assumptions of Theorem 5. Recall the definition of $v_{ij}^{\star}$ in (3.17). Then one has the following decomposition

where $g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star})$ and the residual obeys — with probability exceeding $1-O(n^{-10})$ — that

Note that Theorem 5 and Theorem 6 are concerned with the de-biased estimators built upon both convex and nonconvex estimates. At first glance, one needs to establish theoretical guarantees for each of them separately. Fortunately, as alluded to previously (cf. (3.5)), the convex and nonconvex estimates are extremely close — a fact that has been established in [CCF+19]. The proximity of these two estimates naturally extends to the de-biased estimators constructed based on them. As a result, it suffices to concentrate on proving the theorems for any of these estimators; the claims for the other one follow immediately.

The following key lemma formalizes this argument, which will be established in Appendix C (see also Figure 4 for numerical evidence). Before continuing, we remind the readers of the key notation (see Appendix B for precise definitions):

$(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ : an approximate solution to the nonconvex problem (3.4) (see Appendix A.1);

$\bm{M}^{\mathsf{cvx,d}},\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}}$ : the de-biased estimators built upon the convex optimizer $\bm{Z}^{\mathsf{cvx}}$ ;

$\bm{M}^{\mathsf{ncvx,d}},\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}}$ : the de-biased estimators built upon the nonconvex estimate $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ .

Suppose that the sample size obeys $n^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n$ for some sufficiently large constant $C>0$ and the noise satisfies $\sigma\sqrt{(\kappa^{4}\mu nr\log n)/p}\leq c\sigma_{\min}$ for some sufficiently small constant $c>0$ . Set $\lambda=C_{\lambda}\sigma\sqrt{np}$ with some large enough constant $C_{\lambda}>0$ .

With probability at least $1-O(n^{-10})$ , one has

where $\mathcal{O}^{r\times r}$ is the set of $r\times r$ rotation matrices.

With probability exceeding $1-O(n^{-10})$ , one has

In short, the first part of Lemma 3 tells us that

whereas the second part of Lemma 3 justifies that the proposed de-biased estimator is closely approximated by a linearized version (cf. (4.4)). Note that this linearized form bears a resemblance to the de-biased estimators developed for sparse linear regression [ZZ14, vdGBRD14, JM14a].

With Lemma 3 in place, we shall, from now on, focus on proving the main theorems for the nonconvex estimators, viz.

establishing Theorem 5 for the de-shrunken low-rank factors $(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}})$ ;

establishing Theorem 6 for the de-biased matrix estimator $\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top}$ .

To simplify the presentation hereafter, we shall use the following notation throughout the rest of this section:

$(\bm{X},\bm{Y})$ : the nonconvex estimate $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ ;

$(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}})$ : the de-shrunken estimate defined in (3.8) based on $(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ ;

$\bm{M}^{\mathsf{d}}\triangleq\bm{X}^{\mathsf{d}}\bm{Y}^{\mathsf{d}\top}$ .

2 A precise characterization of the de-shrunken low-rank factors

We start with a precise characterization of the de-shrunken low-rank factors $\bm{X}^{\mathsf{d}}$ and $\bm{Y}^{\mathsf{d}}$ , which paves the way for demonstrating both Theorem 5 and Theorem 6.

One has the following decompositions for $\bm{X}^{\mathsf{d}}$ and $\bm{Y}^{\mathsf{d}}$

which measures the imbalance between the low-rank factors $\bm{X}$ and $\bm{Y}$ .

The claims follow from straightforward algebraic manipulations; see Appendix D.1. ∎

We make a few observations regarding Lemma 4. Take the decomposition of $\bm{X}^{\mathsf{d}}$ (5.8a) as an example:

First, the term $\bm{A}\bm{Y}^{\mathsf{d}}(\bm{Y}^{\mathsf{d}\top}\bm{Y}^{\mathsf{d}})^{-1}$ vanishes when we have full observations, i.e. $p=1$ . Second, the terms involving $\nabla_{\bm{X}}f(\bm{X},\bm{Y})$ and $\bm{\Delta}_{\mathsf{balancing}}$ are both zero if $(\bm{X},\bm{Y})$ is an exact stationary point of $f(\cdot,\cdot)$ ; to see this, it is not hard to verify that any stationary point of $f(\cdot,\cdot)$ necessarily satisfies $\bm{X}^{\top}\bm{X}=\bm{Y}^{\top}\bm{Y}$ , which in turn implies $\bm{\Delta}_{\mathsf{balancing}}=\bm{0}$ . Consequently, the last three terms in (5.8a) are expected to be small when $p$ is sufficiently large and $(\bm{X},\bm{Y})$ is near a stationary point.

We shall make precise these arguments in subsequent subsections.

3 Taking global rotation into account

In order to invoke the decompositions of $\bm{X}^{\mathsf{d}}$ and $\bm{Y}^{\mathsf{d}}$ (cf. Lemma 4) to characterize the estimation errors, we still need to incorporate the (unrecoverable) rotation matrix. From now on, we shall focus primarily on the factor $\bm{X}^{\mathsf{d}}$ . The claims on the other factor $\bm{Y}^{\mathsf{d}}$ can be easily obtained via symmetry.

where we recall that $\bm{H}^{\mathsf{d}}$ is the rotation matrix that best aligns $(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}})$ and $(\bm{X}^{\star},\bm{Y}^{\star})$ (see (3.12)). Substituting the identity

into the decomposition (5.8a), we arrive at

4 Key lemmas for establishing Theorem 5

We now state five key lemmas. Taking these collectively and substituting them into (5.11) immediately establish Theorem 5.

We shall start by controlling the term $\bm{\Phi}_{1}$ as defined in (5.12).

Suppose that the sample complexity obeys $n^{2}p\geq C\kappa^{4}\mu^{2}r^{2}n\log^{3}n$ for some sufficiently large constant $C>0$ and the noise satisfies $\sigma\sqrt{(\kappa^{4}\mu rn\log n)/p}\leq c\sigma_{\min}$ for some sufficiently small constant $c<0$ . Then with probability at least $1-O(n^{-10})$ , we have

Fix any $1\leq j\leq n$ . If the de-shrunken estimate $\overline{\bm{Y}}^{\mathsf{d}}$ were independent of the randomness in the $j$ th row of the matrix, i.e. $\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}(\bm{E})$ , then $\|\bm{e}_{j}^{\top}\bm{\Phi}_{1}\|_{2}$ would be well controlled. This hypothesis is certainly false, as $\overline{\bm{Y}}^{\mathsf{d}}$ clearly depends on $\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}(\bm{E})$ . Nevertheless, by exploiting the leave-one-out technique recently used in [EKBB+13, EK15, AFWZ17, MWCC17, CFMW19, CCF+19, CLL19, DC18], one can properly decouple the dependency and establish the desired bound. See Appendix D.2.∎

The next lemma controls the size of $\|\bm{\Phi}_{2}\|_{2,\infty}$ . In essence, the term $\bm{\Phi}_{2}$ measures the difference between the estimate $\overline{\bm{Y}}^{\mathsf{d}}$ and the true signal $\bm{Y}^{\star}$ ; the closer these two are, the smaller $\|\bm{\Phi}_{2}\|_{2,\infty}$ should be. See Appendix D.3 for the proof of the following result.

Moving on to $\bm{\Phi}_{3}$ and $\bm{\Phi}_{4}$ , one has the following lemmas.

It is straightforward to check that when $p=1$ , one has $\|\bm{\Phi}_{3}\|_{2,\infty}=\|\bm{A}\|=0$ , where we recall the definition of $\bm{A}$ in (5.7). Therefore, one expects $\|\bm{\Phi}_{3}\|_{2,\infty}$ to be small when $p$ is sufficiently large. See Appendix D.4.∎

It is easily seen that the size of $\bm{\Phi}_{4}$ depends on how close $(\bm{X},\bm{Y})$ is to a stationary point of $f(\cdot,\cdot)$ . For instance, in the extreme case where $(\bm{X},\bm{Y})$ is an exact stationary point, then one would have $\bm{\Phi}_{4}=\bm{0}$ . See Appendix D.5.∎

The last lemma asserts that $\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p$ is, in some sense, close to a zero-mean Gaussian random matrix with the desired covariance.

Suppose that the sample size obeys $n^{2}p\geq C\kappa^{2}\mu rn\log^{3}n$ for some sufficiently large constant $C>0$ . Then one has the decomposition

Fix any $1\leq j\leq n$ . The $j$ th row, namely $\bm{e}_{j}^{\top}[\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p]$ is conditionally Gaussian in the sense that

where we recall that $\delta_{jk}=\operatorname{\mathds{1}}\{(j,k)\in\Omega\}$ . Recognize that the conditional covariance matrix concentrates sharply around its expectation, i.e. $\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p$ , which is the covariance matrix of $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{j}$ that we are after. Hence, one can expect that $\mathcal{P}_{\Omega}(\bm{E})\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}/p$ is, marginally, not too far from a Gaussian random matrix. This argument can be carried out formally; see Appendix D.6. ∎

5 From low-rank factors to matrix entries (Proof of Theorem 6)

We now turn attention to inference on the matrix entries, by establishing Theorem 6. Towards this, we first make the following observation: for any $1\leq i,j\leq n$ ,

One can readily apply the decompositions in Theorem 5 to obtain

Take the preceding three identities collectively to reach

Following the same route as in Section 5.4, one can verify that $\Theta_{ij}=\bm{e}_{i}^{\top}\bm{Z}_{\bm{X}}\bm{Y}^{\star\top}\bm{e}_{j}+\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ is approximately Gaussian, whereas the residual term $\Lambda_{ij}$ is small in magnitude. These claims are formally stated in the next two lemmas, with the proofs deferred to Appendix E.

Suppose that $np\geq C\kappa^{2}\mu r^{2}\log^{2}n$ for some sufficiently large constant $C>0$ . Then we have the decomposition

where $g_{ij}\sim\mathcal{N}(0,v_{ij}^{\star})$ and the remaining term $\theta_{ij}$ satisfies — with probability exceeding $1-O(n^{-10})$ — that

Putting the above two lemmas together immediately establishes Theorem 6 and hence Theorem 2.

Discussion

The present paper makes progress towards inference and uncertainty quantification for noisy matrix completion, by developing simple de-biased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely sub-optimal in the following aspects:

Dependency on the rank and the condition number. To enable valid inference, our sample complexity (cf. (3.13) and (3.18a)) scales sub-optimally with the rank $r$ and the condition number $\kappa$ . The sub-optimality can be understood through comparisons with the sample size requirement $O(nr\log^{2}n)$ in the noise-free settings, which is independent of $\kappa$ and matches the information limit (up to some $\log$ factor). Improving such dependency calls for more refined analysis techniques.

Detection of the size of the entries. On one hand, when the size of the entry $M_{ij}^{\star}$ is moderately large (cf. (3.18b)), Corollary 1 allows us to construct a valid confidence interval for it. On the other hand, when $\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ vanishes, Theorem 5 tells us that the estimation error $M_{ij}^{\mathsf{d}}-M_{ij}^{\star}$ is better approximated by the inner product of two independent Gaussian random vectors. It remains to be seen how to determine whether $\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ is too small.

Low signal-to-noise (SNR) regime. Our theory operates under the moderate-to-high SNR regime, where $\sigma^{2}_{\min}/\sigma^{2}$ (which is proportional to the SNR) is required to exceed the order of $n/p$ ; see the conditions in Theorem 5. It is unclear whether the connection between the convex and the nonconvex estimators hold in the low SNR regime. How to conduct inference in such a scenario is an important future direction.

In addition, our investigation has been dedicated to a natural random model, which by no means covers the most general settings of practical interest. There are numerous possible extensions that merit future investigation:

Approximate low-rank structure. Our current theory is built upon the exact low-rank structure of $\bm{{M}}^{\star}$ . Realistically, the matrix of interest is often only approximately low-rank. It is of great interest to study how to carry out statistical inference under such imperfect structural assumptions.

More general sampling patterns. This paper operates under the uniform random sampling assumption, which might sometimes be off in practical situations. It would be interesting to investigate whether our results in this paper can extend to more general non-uniform sampling patterns (e.g. [NW12]).

Extensions to robust PCA, sparse PCA, and 1-bit matrix completion. A variety of important extensions of matrix completion have been explored in prior literature, including but not limited to the case with sparse outliers (i.e. robust PCA [CLMW11, CSPW11]), the case where the matrix of interest is simultaneously sparse and low-rank (i.e. sparse PCA [ZHT06, CMW13]), and the case where only finite-bit observations are available (i.e. 1-bit matrix completion [DPVW14, CZ13]). Performing valid uncertainty assessment for these scenarios requires non-trivial extensions of the link between convex and nonconvex optimization.

Other loss functions. In the estimation stage, one might sometimes prefer other loss functions beyond the penalized squared loss. This might arise from either statistical considerations (e.g. employing a penalized Poisson log-likelihood to accommodate Poisson noise [CX16]), or computational concerns (e.g. adopting a non-smooth loss to improve convergence [CCD+19]). It would be of fundamental importance to develop a unified inferential framework that covers a broader family of loss functions.

Acknowledgements

Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton SEAS innovation award. J. Fan is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C. Ma is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

Appendix A Preliminaries

In this section, we gather several notation and preliminary facts that are useful throughout the analysis. All the proofs, if needed, are deferred to Appendix I.

To begin with, we make precise the algorithm used to minimize the nonconvex loss function (5.1). Specifically, we describe the following details that are crucial for us to implement Algorithm 1:

Set the initial point to be $(\bm{X}^{0},\bm{Y}^{0})=(\bm{X}^{\star},\bm{Y}^{\star})$ or the spectral initialization as in [MWCC17, CLL19];

Set the stepsize $\eta\asymp 1/(n^{6}\kappa^{3}\sigma_{\max})$ ;

Set the maximum number of iterations to be $t_{0}\asymp n^{23}$ ;

The returned estimate is $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})\triangleq(\bm{X}^{t_{\star}},\bm{Y}^{t_{\star}})$ , where

In words, we run gradient descent in Algorithm 1 until we reach a point whose gradient is exceedingly small.

Many of the preliminary facts below were established for the case $(\bm{X}^{0},\bm{Y}^{0})=(\bm{X}^{\star},\bm{Y}^{\star})$ [CCF+19], which is certainly not implementable in practice, however, it serves as a good proxy for studying the convex estimator. Fortunately, the same theoretical guarantees stated in Appendix A.2 can be readily established for spectral initialization using almost the same arguments adopted in [MWCC17, CLL19, CCF+19]. We omit this part mainly for the sake of brevity.

To facilitate analysis, we introduce a set of auxiliary nonconvex loss functions. For any $1\leq j\leq n$ , define

be the nonconvex estimate returned by this auxiliary algorithm (i.e. Algorithm 2), which serves as an approximate solution to (A.2).

A.2 Properties of approximate nonconvex solutions

This subsection gathers the properties of the (approximate) nonconvex solutions. Throughout this subsection, we use the shorthand

and recall the definition of $(\bm{X}^{(j)},\bm{Y}^{(j)})$ in (A.3). The regularization parameter is chosen to satisfy

The claims stated below hold under the sample complexity and the noise condition presumed in [CCF+19, Theorem 1] (see also Theorem 5 in the current manuscript)

The first set of facts is related to $(\bm{X},\bm{Y})$ . In view of [CCF+19], $\bm{F}$ is a faithful estimateTechnically, the statements in [CCF+19, Lemma 5] are for $\eta\asymp 1/(n\kappa^{3}\sigma_{\max})$ and $t_{0}\asymp n^{18}$ . Nevertheless, inspecting their proofs reveals that the claims continue to hold for our choices $\eta\asymp 1/(n^{6}\kappa^{3}\sigma_{\max})$ and $t_{0}\asymp n^{23}$ . of $\bm{F}^{\star}$ , in the sense that

hold with probability exceeding $1-O(n^{-10})$ . In addition, on the same high-probability event, one has

In words, the first claim ensures that $(\bm{X},\bm{Y})$ is an approximate stationary point of $f(\cdot,\cdot)$ ; the second bound tells us that $(\bm{X},\bm{Y})$ is nearly balanced, in the sense that $\bm{X}^{\top}\bm{X}\approx\bm{Y}^{\top}\bm{Y}$ ; the last one formalizes the proximity between the convex solution and the nonconvex one; see also (3.5).

hold with probability at least $1-O(n^{-10})$ .

As has been shown in [CCF+19], the leave-one-out auxiliary point $(\bm{X}^{(j)},\bm{Y}^{(j)})$ satisfies

with probability exceeding $1-O(n^{-10})$ .

Parallel to the transition from $(\bm{X},\bm{Y})$ to $(\bm{X}^{\mathsf{d}},\bm{Y}^{\mathsf{d}})$ , we set

to be the de-shrunken estimators of $\bm{X}^{(j)}$ and $\bm{Y}^{(j)}$ , respectively. We shall demonstrate in Appendix I that, with probability at least $1-O(n^{-10})$ ,

In addition to these four sets of claims, we have the following immediate consequence of the incoherence condition (2.4)

Moreover, recall that $\bm{A}=(1/p)\cdot\mathcal{P}_{\Omega}\left(\bm{X}\bm{Y}^{\top}-\bm{X}^{\star}\bm{Y}^{\star\top}\right)-\left(\bm{X}\bm{Y}^{\top}-\bm{X}^{\star}\bm{Y}^{\star\top}\right)$ (cf. (5.7)). We obtain from the proof of [CCF+19, Lemma 8] that

Last but not least, we list a few simple but useful results: the nonconvex solution $\bm{F}$ satisfies

The same holds true if we replace $\bm{F}$ by either $\bm{F}^{\mathsf{d}}$ , $\bm{F}^{(j)}$ $\bm{F}^{\mathsf{d},(j)}$ or their corresponding low-rank factors. Here $j$ can vary from 1 to $n$ .

Appendix B Summary of the proposed estimators

Let $\bm{Z}^{\mathsf{cvx}}$ be the minimizer of the convex program (3.1), and let $(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ be the solution returned by the Algorithm 1 (with algorithmic details specified in Appendix A.1). Recall that $\bm{Z}^{\mathsf{cvx},r}$ is the best rank- $r$ approximation of $\bm{Z}^{\mathsf{cvx}}$ , viz.

In addition, we let the matrix estimate obtained by the nonconvex algorithm be $\bm{Z}^{\mathsf{ncvx}}\triangleq\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ . We further denote by $(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}})$ the estimate of low-rank factors obtained by convex relaxation; more specifically, we set $(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}})$ to be the balanced rank- $r$ factorization of $\bm{Z}^{\mathsf{cvx},r}$ obeying $\bm{X}^{\mathsf{cvx}}\bm{Y}^{\mathsf{cvx}\top}=\bm{Z}^{\mathsf{cvx},r}$ and $\bm{X}^{\mathsf{cvx}\top}\bm{X}^{\mathsf{cvx}}=\bm{Y}^{\mathsf{cvx}\top}\bm{Y}^{\mathsf{cvx}}$ . With these notations in place, our de-biased and de-shrunken estimators can be summarized as follows.

De-shrunken estimators for low-rank factors:

Appendix C Proof of Lemma 3

Throughout this section, let $\bm{U}\bm{\Sigma}\bm{V}^{\top}$ be the rank- $r$ SVD of the nonconvex estimate $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ and $T$ the tangent space of the set of rank- $r$ matrices at $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ . Correspondingly, we denote by $\mathcal{P}_{T}$ the projection operator onto the tangent space $T$ , and let $\mathcal{P}_{T^{\perp}}=\mathcal{I}-\mathcal{P}_{T}$ , where $\mathcal{I}$ is the identity operator.

In essence, we intend to justify that $\bm{M}^{\mathsf{cvx,d}}$ , $\bm{M}^{\mathsf{ncvx,d}}$ and $\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top}$ are all very close to $\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}$ .

Recall from the definition of the de-biased estimator $\bm{M}^{\mathsf{cvx,d}}$ (cf. (B.1a)) that

Replacing $\bm{Z}^{\mathsf{cvx}}$ by $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ results in

Apply the proximity bound (A.12) to obtain (recall that in (A.12), one has $(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}})$ )

as long as $n^{5}p\gg\kappa^{2}$ . In addition, in view of [CCF+19, Claim 2], one has the decomposition

where the middle line follows since $\bm{U}\bm{\Sigma}\bm{V}^{\top}$ is defined to be the SVD of $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ .

where the last inequality results from (C.3) and (C.5). Combining the above two bounds with the fact that $\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}$ and $\mathcal{P}_{T}(\bm{\Delta}_{\bm{Z}}-\frac{1}{p}\bm{R})$ are orthogonal to each other, we arrive at the conclusion that $\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}$ is the top- $r$ SVD of $\bm{C}$ and

Second, let $\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top}$ be the top- $r$ SVD of $\bm{C}+\bm{\Delta}$ . By definition, one has $\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top}=\bm{M}^{\mathsf{cvx,d}}$ . We are left with checking the two conditions in Lemma 14. To begin with, the perturbation term $\bm{\Delta}$ obeys

where (i) comes from (C.3) and (C.5) and the last inequality (ii) arises since $np\gg\kappa^{2}.$ Clearly, the size of the perturbation is much smaller than $\lambda/p$ and hence $\|\bm{C}\|$ (cf. (C.8a)). In addition,

where the equality depends on (C.8b) and the last line results from (C.7) and (C.9). Consequently,

Here the first relation arises from (C.8a) and the second holds since $\sigma_{r}(\bm{\Sigma})\geq\sigma_{\min}/2$ , a simple consequence of (A.19). We are now ready to apply Lemma 14 to obtain

where we have used the fact that $\|\bm{\Sigma}+(\lambda/p)\bm{I}_{r}\|\lesssim\sigma_{\max}$ , which also can be derived from (A.19). The above bound combined with (C.9) yields

as long as $np\gg\kappa^{3}$ . We remark that by setting $\bm{\Delta}_{\bm{Z}}=\bm{0}$ , one also obtains the bound on $\bm{M}^{\mathsf{ncvx,d}}$ , i.e.

We move on to investigating $\|\bm{X}^{\mathsf{ncvx,d}}\bm{Y}^{\mathsf{ncvx,d}\top}-\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}\|$ , for which we have the following claim.

Taking the above three bounds collectively and recognizing that $\lambda\lesssim\sigma\sqrt{np}$ yield the advertised bound (5.4a).

The last inequality is the small-gradient condition (see (A.10), in which $(\bm{X},\bm{Y})=(\bm{X}^{\mathsf{ncvx}},\bm{Y}^{\mathsf{ncvx}}))$ . Employ the definitions for $\bm{X}^{\mathsf{ncvx,d}}$ and $\bm{Y}^{\mathsf{ncvx,d}}$ (cf. (B.2a) and (B.2b)) to see that

It then boils down to showing that (i) $\bm{{A}}_{1}$ is very close to $\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})\bm{V}^{\top}$ , and (ii) $\bm{{A}}_{2}$ is small in size.

First, recall that $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{{U}}\bm{{\Sigma}}\bm{{V}}^{\top}$ , which combined with (C.12) gives

Here, the last inequality comes from (C.13) and its immediate consequence that $\|\bm{{\Sigma}}_{\bm{{Q}}}\|\asymp\|\bm{{\Sigma}}_{\bm{{Q}}}^{-1}\|\asymp 1$ .

Second, apply the perturbation bound for matrix square roots (cf. Lemma 13) to obtain

Here, the inequality (i) depends on the facts that

whereas the inequality (ii) holds because of the facts that $\|(\bm{X}^{\mathsf{ncvx}\top}\bm{X}^{\mathsf{ncvx}})^{-1}\|\lesssim 1/\sigma_{\min}$ , $\|(\bm{Y}^{\mathsf{ncvx}\top}\bm{Y}^{\mathsf{ncvx}})^{-1}\|\lesssim 1/\sigma_{\min}$ and the balancedness condition (A.11)

As a result, the operator norm of $\bm{{A}}_{2}$ is bounded by

Take (C.14), (C.15) and (C.17) collectively to arrive at

C.2 Proof of the inequality (5.4b)

Next, we switch attention to the low-rank factors. Our goal is to demonstrate that $(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}})$ and $(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}})$ are both extremely close to $(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2})$ modulo some global rotation, which will be established in (C.19) and (C.20) shortly.

We start by justifying the proximity between $(\bm{X}^{\mathsf{ncvx,d}},\bm{Y}^{\mathsf{ncvx,d}})$ and $(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2})$ . In view of (C.12), we know that

The latter bound follows from similar derivations as in (C.16). A similar bound holds for $\bm{Y}^{\mathsf{ncvx,d}}$ . Recognizing that

Next, we establish the connection between $(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}})$ and $(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2})$ . To accomplish this, we first study the relationship between $(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}})$ and $(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2})$ . Recall that $\bm{X}^{\mathsf{cvx}}$ and $\bm{Y}^{\mathsf{cvx}}$ constitute a balanced factorization of $\bm{Z}^{\mathsf{cvx},r}$ , while $(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2})$ is a balanced one of $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{U}\bm{\Sigma}\bm{V}^{\top}$ . Hence one can view $\bm{Z}^{\mathsf{cvx},r}$ as a perturbation of $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}=\bm{U}\bm{\Sigma}\bm{V}^{\top}$ and investigate the perturbation bounds on the balanced factorizations. Going through the same derivations as in [MWCC17, Appendix B.7] and [CLL19, Appendix B.2.1], one reaches

Here the last relation follows from the proximity of the convex estimator and the nonconvex estimator; see (A.12). Repeating the same argument as above to translate the bound between $(\bm{X}^{\mathsf{cvx}},\bm{Y}^{\mathsf{cvx}})$ and $(\bm{U}\bm{\Sigma}^{1/2},\bm{V}\bm{\Sigma}^{1/2})$ to that of $(\bm{X}^{\mathsf{cvx,d}},\bm{Y}^{\mathsf{cvx,d}})$ and $(\bm{U}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2},\bm{V}(\bm{\Sigma}+\frac{\lambda}{p}\bm{I}_{r})^{1/2})$ , we conclude that

This together with (C.19) and the assumption $n\gg\kappa^{4}$ concludes the proof.

C.3 Proof of the inequality (5.5)

We shall focus on proving the claim for the nonconvex estimator $\bm{M}^{\mathsf{ncvx,d}}$ and $\bm{X}^{\mathsf{ncvx}}\bm{Y}^{\mathsf{ncvx}\top}$ ; the claim for the convex estimator $\bm{Z}^{\mathsf{cvx}}$ can be treated similarly.

To see this, it has been established in Appendix C.1 that

Appendix D Analysis of the low-rank factors

We concentrate on the factor $\bm{X}^{\mathsf{d}}$ ; the other factor $\bm{Y}^{\mathsf{d}}$ can be treated similarly. By definition of the gradient, one has

with $\bm{A}$ defined in (5.7), we can rearrange (D.1) as follows

By construction, the de-shrunken estimator $\bm{Y}^{\mathsf{d}}$ satisfies the following property

where the last identity follows since $(\bm{Y}^{\top}\bm{Y})^{1/2}$ and $(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2}$ commute. Combining (D.3) with the identity (D.4) gives

Multiplying both sides of (D.5) by $(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2}$ and recalling the definition of $\bm{Y}^{\mathsf{d}}$ in (3.8), we have

Since $\bm{Y}^{\mathsf{d}\top}\bm{Y}^{\mathsf{d}}$ and $(\bm{I}_{r}+\frac{\lambda}{p}(\bm{Y}^{\top}\bm{Y})^{-1})^{1/2}$ also commute, we have

where the last relation uses the definition of $\bm{X}^{\mathsf{d}}$ (see (3.8)).

Substituting the identity (D.7) back into (D.6) and making a few elementary algebraic manipulations yield the desired decomposition (5.8a).

D.2 Proof of Lemma 5

Recall that $\overline{\bm{Y}}^{\mathsf{d}}=\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}}$ and similarly define

The triangle inequality tells us that for any fixed $1\leq j\leq n$ ,

In what follows, we shall control $\alpha_{1}$ and $\alpha_{2}$ separately.

To begin with, denoting $\bm{\Delta}^{(j)}\triangleq\overline{\bm{Y}}^{\mathsf{d},(j)}(\overline{\bm{Y}}^{\mathsf{d},(j)\top}\overline{\bm{Y}}^{\mathsf{d},(j)})^{-1}-\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}$ results in

Before proceeding, we gather a few useful facts regarding $\bm{\Delta}^{(j)}$ , as summarized in the following claim.

With probability at least $1-O(n^{-11})$ , we have

With the bounds on $\|\bm{\Delta}^{(j)}\|$ and $\|\bm{\Delta}^{(j)}\|_{2,\infty}$ in place, we are ready to control $\alpha_{1}$ . By construction, $\bm{\Delta}^{(j)}$ is independent of $\bm{e}_{j}^{\top}\mathcal{P}_{\Omega}\left(\bm{E}\right)$ . Therefore, the vector on the right-hand side of (D.8), $\frac{1}{p}\sum_{k=1}^{n}E_{jk}\delta_{jk}\bm{\Delta}_{k,\cdot}^{(j)}$ , is a sum of conditionally independent random vectors. In particular, conditional on $\bm{\Delta}^{(j)}$ and $\{\delta_{jk}\}_{k:1\leq k\leq n}$ , one has

Invoke the concentration inequality for Gaussian random vectors [HKZ12, Proposition 1.1] to see that

with probability at least $1-e^{-t}$ . It remains to control $\|\hat{\bm{\Sigma}}\|$ , which we state in the following claim.

Suppose that $n^{2}p\gg\kappa^{2}\mu rn\log^{2}n$ . Then with probability exceeding $1-O(n^{-11})$ ,

Combine the upper bound on $\|\hat{\bm{\Sigma}}\|$ with (D.10) and choose $t\asymp\log n$ to arrive at

with probability exceeding $1-O(n^{-11})$ .

We move on to bounding $\alpha_{2}$ , for which we have

Here (i) uses the fact that $\|\mathcal{P}_{\Omega}(\bm{E})\|\lesssim\sigma\sqrt{np}$ (see [CCF+19, Lemma 3]), the perturbation bounds for pseudo-inverses (see Lemma 12) and (A.19); the penultimate inequality (ii) comes from the fact that $\|\overline{\bm{Y}}^{\mathsf{d}}-\overline{\bm{Y}}^{\mathsf{d},(j)}\|\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n\log n}{p}}\|\bm{Y}^{\star}\|_{2,\infty}$ (see (A.16c)) and last one uses the incoherence condition $\|\bm{Y}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n}$ (cf. (A.17)).

Combine the bounds on $\alpha_{1}$ and $\alpha_{2}$ to reach

Taking the maximum over $1\leq j\leq n$ establishes our bound on $\|\bm{\Phi}_{1}\|_{2,\infty}$ .

Apply the perturbation bound for pseudo-inverses (see Lemma 12) to obtain

Here we have utilized the facts that $\|\overline{\bm{Y}}^{\mathsf{d},(j)}-\bm{Y}^{\star}\|\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n}{p}}\|\bm{X}^{\star}\|$ (see (A.16a)) and a simple consequence of (A.19), viz.

Moreover, the triangle inequality tells us that

where the penultimate inequality follows from the facts that $\|\overline{\bm{Y}}^{\mathsf{d},(j)}\|_{2,\infty}\leq 2\|\bm{F}^{\star}\|_{2,\infty}$ , $\|\overline{\bm{Y}}^{\mathsf{d},(j)}-\bm{Y}^{\star}\|_{2,\infty}\lesssim\kappa\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{n\log n}{p}}\|\bm{F}^{\star}\|_{2,\infty}$ (see (A.16b)) and that

Here the penultimate inequality follows from (A.19). The proof of the claim is then complete. ∎

Conditional on $\bm{{\Delta}}^{(j)}$ , using Bernstein’s inequality and the fact that $\bm{\Delta}^{(j)}$ and $\{\delta_{jk}\}_{k:1\leq k\leq n}$ are independent, we arrive at that with probability exceeding $1-O(n^{-11})$ ,

As a result, with probability at least $1-O(n^{-11})$ , we have

as long as $np\gg\kappa^{2}\mu r\log^{2}n$ . Here the middle inequality uses Claim 2. In view of the triangle inequality,

with the proviso that $n^{2}p\gg\kappa^{2}\mu rn\log^{2}n$ . Again, the last line makes use of Claim 2. This concludes the proof of the claim. ∎

D.3 Proof of Lemma 6

where the second inequality follows from the incoherence assumption that $\|\bm{e}_{j}^{\top}\bm{X}^{\star}\|_{2}\leq\|\bm{X}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n}$ (cf. (A.17)) and the fact that $\|(\overline{\bm{Y}}^{\mathsf{d}\top}\overline{\bm{Y}}^{\mathsf{d}})^{-1}\|\lesssim 1/\sigma_{\min}$ , a simple consequence of (A.19).

It remains to control $\|(\overline{\bm{Y}}^{\mathsf{d}}-\bm{Y}^{\star})^{\top}\overline{\bm{Y}}^{\mathsf{d}}\|$ . To simplify notation hereafter, define $\bm{\Delta}_{\bm{X}}\triangleq\overline{\bm{X}}^{\mathsf{d}}-\bm{X}^{\star}$ and $\bm{\Delta}_{\bm{Y}}\triangleq\overline{\bm{Y}}^{\mathsf{d}}-\bm{Y}^{\star}$ . First, observe that

Second, in view of the decomposition of $\bm{Y}^{\mathsf{d}}$ given in (5.8b), we have

The following claim connects $\bm{\Delta}_{\bm{Y}}^{\top}\bm{Y}^{\star}$ with $\bm{X}^{\star\top}\bm{\Delta}_{\bm{X}}$ .

This relation together with (D.18) yields

A little algebraic manipulation then gives

It is easy to check from (A.19) that $0.25\sigma_{\min}\bm{I}_{r}\preceq\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}},\bm{\Sigma}^{\star}\preceq 4\sigma_{\max}\bm{I}_{r}$ . Hence one can invoke Lemma 15 with $\bm{X}=\bm{\Delta}_{\bm{Y}}^{\top}\bm{Y}^{\star}$ , $\bm{A}=\bm{\Sigma}^{\star}$ , $\bm{B}=\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}$ and $\bm{C}=\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\bm{S}+0.5(\bm{\Delta}_{\bm{X}}^{\top}\bm{\Delta}_{\bm{X}}-\bm{\Delta}_{\bm{Y}}^{\top}\bm{\Delta}_{\bm{Y}})\bm{\Sigma}^{\star}+\bm{{\Delta}}_{\bm{{XY}}}^{\mathsf{{d}}}\bm{\Sigma}^{\star}$ to obtain

where we have plugged in the definition of $\bm{S}$ (see (D.19)) and used the identity $\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\bm{H}^{\mathsf{d}\top}(\bm{X}^{\mathsf{d}\top}\bm{X}^{\mathsf{d}})^{-1}=\bm{H}^{\mathsf{d}\top}$ . Combine the above inequality with (D.16) to obtain

It then boils down to controlling the above terms $\alpha_{1},\alpha_{2},\alpha_{3}$ and $\alpha_{4}$ .

First, the term $\alpha_{4}$ can be upper bounded by

Here, the second line utilizes the facts that $\|(\bm{I}_{r}+\lambda(\bm{X}^{\top}\bm{X})^{-1}/p)^{1/2}\|\lesssim 1$ , $\|\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}}\|\asymp\|\bm{Y}^{\top}\bm{Y}^{\star}\|\lesssim\sigma_{\max}$ and the results in (A.10), (A.13e) and (C.16).

Moving on to $\alpha_{3}$ , we recall from (A.13b) that

Finally, for the term $\alpha_{1}$ , by the triangle inequality one has

Observe that conditional on $\{\delta_{jk}\}_{1\leq j,k\leq n}$ one has

As a result, we obtain that with probability at least $1-O(n^{-10})$ ,

Here, the second relation uses the fact that

with probability at least $1-O(n^{-10})$ as long as $n^{2}p\gg\mu rn\log n$ , which follows from [MWCC17, Lemma 38] or [CR09, Section 4.2] by observing that $\bm{{X}}_{\cdot,i}^{\star}(\bm{{Y}}_{\cdot,j}^{\star})^{\top}$ lies in the tangent space of $\bm{{M}}^{\star}$ . Take (D.21) and (D.22) collectively to reach

Substituting the bounds on $\alpha_{1}$ , $\alpha_{2}$ , $\alpha_{3}$ and $\alpha_{4}$ back to (D.20) results in

Taking the maximum over $1\leq j\leq n$ leads to the desired result.

Finally, we are left with proving Claim 4.

First, by $\bm{X}^{\star\top}\bm{X}^{\star}=\bm{Y}^{\star\top}\bm{Y}^{\star}$ , one can obtain

Second, since $\bm{{H}}^{\mathsf{{d}}}$ is the best rotation matrix to align $(\bm{X}^{\mathsf{{d}}},\bm{Y}^{\mathsf{{d}}})$ and $(\bm{X}^{\star},\bm{Y}^{\star})$ , we know from [MWCC17, Lemma 35] that

D.4 Proof of Lemma 7

Here (i) and (iv) rely on the unitary invariance of the operator norm, (ii) uses the definition of $\bm{Y}^{\mathsf{d}}$ (see (3.8)) and (iii) follows from the choice $\lambda\lesssim\sigma\sqrt{np}$ and immediate consequences of (A.19)

Therefore, it suffices to control $\|\bm{e}_{j}^{\top}\bm{A}\bm{Y}\bm{H}\|_{2}$ . To this end, we have the following decomposition

we can rewrite the first term of (D.25) as

Since $(\bm{X}^{(j)},\bm{Y}^{(j)})$ is independent of $\{\delta_{jk}\}_{1\leq k\leq n},$ the right hand side of the above equation can be viewed as a sum of independent random vectors, conditional on $(\bm{X}^{(j)},\bm{Y}^{(j)})$ . Invoke Bernstein’s inequality to see that

holds with probability at least $1-O(n^{-10})$ . Here, we denote

Here the penultimate inequality uses (A.14d) and the bound $\|\bm{Y}^{(j)}\|_{2,\infty}\lesssim\sqrt{\mu r\sigma_{\max}/n}$ . We arrive at the conclusion that: with probability exceeding $1-O(n^{-10})$ ,

Next, we move on to the second term $\bm{\Delta}_{2}$ of (D.25), which can be further decomposed as follows

In what follows, we bound $\bm{\theta}_{1}$ and $\bm{\theta}_{2}$ sequentially.

Regarding $\bm{\theta}_{1}$ , using the definition of $\bm{A}$ we obtain

where the second relation holds due to (A.14b) and the fact that $\|\bm{A}\|\lesssim\sigma\sqrt{\frac{n}{p}}\sqrt{\frac{\kappa^{4}\mu^{2}r^{2}\log n}{np}}$ (cf. (A.18)).

Moving on to $\bm{\theta}_{2}$ , we can utilize the identity

With regards to $\alpha_{1}$ , we have by Bernstein’s inequality and (A.14b) that

holds with probability exceeding $1-O(n^{-10})$ . Here, we define

provided that $np\gg\mu r\log n$ . Here we apply the bounds $\|\bm{Y}^{(j)}\|\lesssim\sqrt{\sigma_{\max}}$ and $\|\bm{Y}^{(j)}\|_{2,\infty}\lesssim\sqrt{\mu r\sigma_{\max}/n}$ (see (A.19) and the following remarks). In the end, we turn to the term $\alpha_{2}$ , which obeys

where the second line arises from the Cauchy-Schwarz inequality.

Take the previous bounds collectively to arrive at

as long as $\frac{\sigma}{\sigma_{\min}}\sqrt{\frac{\kappa^{2}n\log n}{p}}\ll 1$ . Finally, we conclude that

D.5 Proof of Lemma 8

First, it is straightforward to verify that

where the last line arises from (A.10), the choice $\lambda\lesssim\sigma\sqrt{np}$ (cf. (A.6)), and the bounds

Here the latter two are immediate consequences of (A.19). Second, with regards to the term involving $\bm{\Delta}_{\mathsf{balancing}}$ , we have

where the middle line uses (A.19) and the last one follows from (C.16).

Combine (D.27), (D.28) and the triangle inequality to establish the advertised result, with the proviso that $n^{2}\gg\kappa^{3}\mu r$ .

D.6 Proof of Lemma 9

We invoke the identity $\bm{Y}^{\star}(\bm{Y}^{\star\top}\bm{Y}^{\star})^{-1}=\bm{V}^{\star}(\bm{\Sigma}^{\star})^{-1/2}$ (since $\bm{Y}^{\star}=\bm{V}^{\star}(\bm{\Sigma}^{\star})^{1/2}$ ) to see that for any $1\leq i\leq n$ ,

consists of a sum of independent random vectors, where we recall that $\delta_{ik}=\operatorname{\mathds{1}}\{(i,k)\in\Omega\}$ . In addition, the right-hand side of the above formula is conditionally Gaussian, namely,

Note that $\bm{S}$ depends on the index $i$ through $\{\delta_{ik}\}_{k:1\leq k\leq n}$ . Denote by $\bm{S}^{\star}$ the expectation of $\bm{S}$ , that is,

Clearly, when $np\gg\kappa^{2}\mu r\log n$ , one has $\bm{S}\succ\bm{0}$ on the event $\mathcal{E}$ and hence $\bm{S}^{-1/2}$ is well-defined. As a result, on the event $\mathcal{E}$ , we have

As can be easily seen from (D.29) and (D.30), each row of $\bm{Z}_{\bm{X}}$ follows the Gaussian distribution

It remains to show that, with high probability,

In what follows, we shall bound the two terms on the right-hand side of the above display sequentially.

First, observe that $\sum_{k=1}^{n}E_{ik}\delta_{ik}\bm{V}_{k,\cdot}^{\star}$ involves a sum of independent random vectors with

where $\|\cdot\|_{\psi_{1}}$ denotes the sub-exponential norm [Ver17]. One can then apply the matrix Bernstein inequality [Kol11, Theorem 2.7] to conclude that with probability at least $1-O(n^{-20})$ ,

Next, we move on to $\|\bm{I}_{r}-\bm{S}^{-1/2}(\bm{S}^{\star})^{1/2}\|$ . Recall that on the event $\mathcal{E}$ , one has

This together with the fact that $\sigma^{2}/(p\sigma_{\max})\leq\lambda_{\min}(\bm{S}^{\star})\leq\lambda_{\max}(\bm{S}^{\star})\leq\sigma^{2}/(p\sigma_{\min})$ gives

with the proviso that $np\gg\kappa^{2}\mu r\log n$ . Therefore, straightforward calculations yield

Here the second relation is the perturbation bound for the matrix square roots (see Lemma 13).

Combine the above two bounds to conclude that

with probability at least $1-O(n^{-10})$ . Here the last line utilizes the matrix Bernstein inequality, where

Consequently with probability exceeding $1-O(n^{-10})$ one has

Appendix E Analysis of the entries of the matrix

The term $\Lambda_{ij}$ can be naturally split into two terms, namely

In what follows, we shall bound each term individually.

Regarding the first term, one sees from Theorem 5 that with probability exceeding $1-O(n^{-10})$

where the last line follows since $\|\bm{X}_{i,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{U}_{i,\cdot}^{\star}\|_{2}$ and $\|\bm{Y}_{j,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ .

Turning to the second term, we have by the Cauchy-Schwarz inequality that

where the penultimate inequality uses (A.13d) and the last one depends on the incoherence assumption that $\|\bm{F}^{\star}\|_{2,\infty}\leq\sqrt{\mu r\sigma_{\max}/n}$ (see (A.17)).

Take collectively the above two bounds to complete the proof.

E.2 Proof of Lemma 11

If $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ were independent, then clearly one would have

As such, the main ingredient of the proof boils down to demonstrating that $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ and $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ are nearly independent.

Fortunately, this weak dependency can be easily decoupled, for which we have the following claim.

Suppose that $np\gg\kappa^{2}\mu r^{2}\log^{2}n$ . One has the decomposition

where $\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i}\sim\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p)$ and is independent of $\{\delta_{kj},E_{k}j\}_{k:1\leq k\leq n}$ and hence of $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ . In addition, with probability at least $1-O(n^{-10})$ one has

The desired result follows immediately from Claim 5, since

Similarly, repeating the same argument above, we can also show that $\bm{e}_{i}^{\top}\bm{Z}_{\bm{X}}\bm{Y}^{\star\top}\bm{e}_{j}+\bm{e}_{i}^{\top}\bm{X}^{\star}\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ can be decomposed as a Gaussian random variable $\mathcal{N}\left(0,v_{ij}^{\star}\right)$ as well as a residual term bounded above by $({\sigma}/{\sqrt{p}})\sqrt{({\kappa^{2}\mu r\log n})/({np})}\|\bm{U}_{i,\cdot}^{\star}\|_{2,\infty}$ with high probability. These together finish the proof.

Instate the notation used in Appendix D.6. Recall that

To remove the effect of ${\delta_{ij},E_{ij}}$ on $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}$ , we construct an auxiliary random matrix $\widetilde{\bm{Z}}_{\bm{X}}$ as follows

where $\bm{S}^{\star}=p^{-1}\sigma^{2}\left(\bm{\Sigma}^{\star}\right)^{-1}$ ,

It is easily seen that $\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i}\sim\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p)$ ; more importantly $\widetilde{\bm{Z}}_{\bm{X}}^{\top}\bm{e}_{i}$ is independent of $\{\delta_{kj},E_{kj}\}_{1\leq k\leq n}$ and hence of $\bm{Z}_{\bm{Y}}^{\top}\bm{e}_{j}$ .

which together with the triangle inequality and the fact $\|\bm{S}^{\star}\|=\sigma^{2}/(p\sigma_{\min})$ yields

Here we have used the results in (D.32) and (D.33). We are left with bounding $\|\bm{S}^{-1/2}-\bm{S}_{-j}^{-1/2}\|$ , for which we have

Take the above bound collectively with (D.33) to yield

as long as $np\gg\kappa\mu r$ . As a result, we have

where the middle line relies on the perturbation of matrix square roots; see Lemma 13. Combining all, we arrive at

with the proviso that $np\gg\kappa^{2}\mu r^{2}\log^{2}n$ . This finishes the proof. ∎

Appendix F Proof of Corollary 1

This section is dedicated to establishing the following result, which subsumes Corollary 1 as a special case.

Suppose that the conditions (3.18) hold, and recall the notation in Corollary 1. Then one has

Before entering the main proof of Corollary 2, we make a simple observation that

where we recall that $\overline{\bm{X}}^{\mathsf{d}}=\bm{X}^{\mathsf{d}}\bm{H}^{\mathsf{d}}$ and $\overline{\bm{Y}}^{\mathsf{d}}=\bm{Y}^{\mathsf{d}}\bm{H}^{\mathsf{d}}$ . Here, the first inequality arises from (A.13d) and the second one uses the assumption on $\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ (i.e. (3.18b)). A simple consequence of (F.1) is that

which in conjunction with Theorem 6 yields the following decomposition

With this decomposition at hand, we have that for any $\varepsilon>0$ ,

With probability exceeding $1-O(n^{-10})$ , the term $\Delta_{V}$ obeys

With Claim 6 at hand, we are ready to take

for any $t$ . This immediately establishes Corollary 2.

Suppose for the moment that $|v_{ij}-v_{ij}^{\star}|\leq cv_{ij}^{\star}$ for some $c\leq 1/2$ . Then it follows immediately that

Therefore if suffices to control $|M_{ij}^{\mathsf{d}}-M_{ij}^{\star}|$ and $|v_{ij}^{\star}-v_{ij}|$ (i.e. obtaining the quantity $c$ ).

First, expand $M_{ij}^{\mathsf{d}}$ and $M_{ij}^{\star}$ to see

where the middle line depends on (F.1) and (F.2), and the last inequality arises since $\sigma(\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2})/\sqrt{p}\lesssim\sqrt{v_{ij}^{\star}}$ .

Now we move on to $|v_{ij}^{\star}-v_{ij}|$ . By the definition of $v_{ij}$ , one has

Focusing on the $\bm{X}$ factor, we have — with probability at least $1-O(n^{-10})$ — that

Here, the first relation comes from the identity $\bm{X}_{i,\cdot}^{\mathsf{d}}(\bm{X}^{\mathsf{d}\top}\bm{X}^{\mathsf{d}})^{-1}(\bm{X}_{i,\cdot}^{\mathsf{d}})^{\top}=\overline{\bm{X}}_{i,\cdot}^{\mathsf{d}}(\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}})^{-1}(\overline{\bm{X}}_{i,\cdot}^{\mathsf{d}})^{\top}$ , and the inequality arises from the triangle inequality. Notice that $\|(\overline{\bm{X}}^{\mathsf{d}\top}\overline{\bm{X}}^{\mathsf{d}})^{-1}\|\lesssim 1/\sigma_{\min}$ and that

where the last inequality follows from (A.13b). Using the bounds (F.1) and (F.2), we continue the upper bound in (F.4) as follows

A similar bound holds for the factor $\bm{Y}$ . Therefore, with high probability we have

where the last relation results from the condition on $\|\bm{U}_{i,\cdot}^{\star}\|_{2}+\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ (cf. (3.18b)).

Combine the bounds on $|M_{ij}^{\mathsf{d}}-M_{ij}^{\star}|$ and $|v_{ij}^{\star}-v_{ij}|$ to see that with probability exceeding $1-O(n^{-10})$ ,

This establishes the desired upper bound on $|\Delta_{V}|$ . ∎

Appendix G Proof of Theorem 3

As we have argued in Section 5.1, it suffices to prove the claim for $\bm{M}^{\mathsf{d}}=\bm{X}^{\mathsf{d}}\bm{X}^{\mathsf{d}\top}=\overline{\bm{X}}^{\mathsf{d}}\overline{\bm{Y}}^{\mathsf{d}\top}$ . For simplicity of notation, we define

Apply the decompositions in Theorem 5 to obtain

where we use the fact that $\bm{Y}^{\star\top}\bm{Y}^{\star}=\bm{\Sigma}^{\star}$ . Theorem 5 tells us that $\bm{Z}_{\bm{X}}^{\top}\bm{e}_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(\bm{0},\sigma^{2}(\bm{\Sigma}^{\star})^{-1}/p)$ , which further implies

Now we turn to the term $\mathsf{rem}$ , for which we have the following two claims.

With probability at least $1-O(n^{-10})$ , one has

With probability exceeding $1-O(n^{-10})$ , we have

Combine all of the above bounds to yield the desired result.

Plug in the definition of $\bm{\Theta}$ (cf. (G.1)) and invoke the triangle inequality again to see that

Here the last relation depends on the assumption (3.18a). Second, we have already established in this section that

with probability exceeding $1-O(n^{-10})$ . Substitute the above two facts into (G.2) to arrive at

where $\max\big{\{}\|\bm{\Delta}_{\bm{X}}\|_{2,\infty},\|\bm{\Delta}_{\bm{Y}}\|_{2,\infty}\big{\}}\lesssim\frac{\sigma}{\sqrt{p\sigma_{\min}}}\sqrt{\frac{\kappa^{2}\mu r^{2}\log^{2}n}{np}}$ and hence

Consequently, use the triangle inequality and Cauchy-Schwarz to verify that

where (i) follows since $\bm{X}^{\star}=\bm{U}^{\star}\bm{\Sigma}^{\star 1/2}$ and $\|\bm{U}^{\star}\|=1$ , and (ii) makes use of (G.3) as well as the facts $\|\bm{Y}^{\star}\|,\|\bm{X}^{\star}\|=\sqrt{\sigma_{\max}}$ . In addition, invoke Lemma 9 to see that $\bm{Z}_{\bm{X}}\bm{\Sigma}^{\star 1/2}$ and $\bm{Z}_{\bm{Y}}\bm{\Sigma}^{\star 1/2}$ are both Gaussian matrices with i.i.d. $\mathcal{N}(0,\sigma^{2}/p)$ entries, which together with standard concentration results implies that

with the proviso that $np\gtrsim\kappa^{3}\mu r\log^{3}n$ . This means that, with high probability,

Everything then boils down to controlling $\mathsf{Tr}\big{(}\bm{Z}_{\bm{X},\bm{E}}\bm{Y}^{\star\top}\bm{Z}_{\bm{Y},\bm{E}}\bm{X}^{\star\top}\big{)}$ . Towards this end, we first note that

Similarly, $\bm{Z}_{\bm{Y},\bm{E}}\bm{X}^{\star\top}=[\mathcal{P}_{\Omega}(\bm{E})]^{\top}\bm{U}^{\star}\bm{U}^{\star\top}/p$ . These identities allow us to derive

Apply the same arguments in controlling (D.21) to obtain that with probability at least $1-O(n^{-10})$ ,

as long as $n\gtrsim r\log^{2}n$ . This combined with (G.6) yields the desired claim. ∎

Appendix H Proof of lower bounds

Fix any $\varepsilon>0$ . It suffices to prove that the matrix $\mathsf{CRLB}(\bm{X}_{i,\cdot}^{\star}\mid\Omega)$ defined in (3.27) satisfies

with probability at least $1-O(n^{-10})$ , provided that $np\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu r$ . Towards this end, we first compute

where we recall that $\delta_{ik}=\operatorname{\mathds{1}}\{(i,k)\in\Omega\}$ . Next, define the following event

where $C>0$ is some large absolute constant. On the event $\mathcal{E}$ , in view of the fact $\sigma_{\min}\bm{I}_{r}\preceq\bm{\Sigma}^{\star}\preceq\sigma_{\max}\bm{I}_{r}$ , one has

with the proviso that $np\geq 4C^{2}\kappa^{2}\mu r\log n$ . This further implies that

on the event $\mathcal{E}$ . Clearly, the requirement (H.1) holds true if $np\geq C_{0}\varepsilon^{-2}\kappa^{4}\mu r\log n$ with $C_{0}=4C^{2}$ .

To finish up, we are left with proving that $\mathcal{E}$ occurs with probability at least $1-O(n^{-10})$ . Invoke the matrix Bernstein inequality to show that

holds with probability at least $1-O(n^{-10})$ , where we define

Here we have used the incoherence condition (A.17). Consequently, one reaches the conclusion that with probability exceeding $1-O(n^{-10})$ ,

as long as $np\gg\mu r\log n$ , thus concluding the proof.

H.2 Proof of Lemma 2

The proof strategy is similar to the one used in proving Lemma 1 (cf. Appendix H.1). Fix any $\varepsilon>0$ . It is sufficient to establish the following inequality

where the scalar $\mathsf{CRLB}(M_{ij}^{\star}\mid\Omega)$ is defined in (3.28) and $v_{ij}^{\star}$ is defined in Theorem 2. Expand the left-hand side to reach

where the last line follows from the observations that $\|\bm{Y}_{j,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{V}_{j,\cdot}^{\star}\|_{2}$ and $\|\bm{X}_{i,\cdot}^{\star}\|_{2}\leq\sqrt{\sigma_{\max}}\|\bm{U}_{i,\cdot}^{\star}\|_{2}$ .

where $C>0$ is some large universal constant. Two observations are sufficient to derive the desired the result (H.2). First, the event $\mathcal{E}_{2}$ happens with probability at least $1-O(n^{-10})$ — an easy consequence of the proof of Lemma 1 (cf. Appendix H.1). Second, on the event $\mathcal{E}_{2}$ , repeating the same proof of Lemma 1 (cf. Appendix H.1), one can deduce that

Comparing (H.2) and (H.3), one arrives at the desired result as long as $np\geq 4C^{2}\varepsilon^{-2}\kappa^{4}\mu r\log n$ .

Appendix I Proofs in Section A

We start with (A.13a). Invoke the triangle inequality to get

where the last relation depends on the unitary invariance of the operator norm and (A.9b). It then boils down to controlling $\|\bm{F}^{\mathsf{d}}-\bm{F}\|$ . Notice that

where the last inequality uses $\|\bm{Y}\|\leq\|\bm{F}\|\leq 2\|\bm{X}^{\star}\|$ (cf. (A.19)), the fact that $\lambda\lesssim\sigma\sqrt{np}$ (see (A.6)), the bound (C.16) and the condition $n^{5}\gg\kappa$ . Apply the perturbation bound for matrix square roots (see Lemma 13) to obtain that

Here, (i) uses the facts that $\|(\bm{X}^{\top}\bm{X})^{-1}\|\lesssim 1/\sigma_{\min}$ and that $\lambda_{\min}[(\bm{I}_{r}+\lambda/p(\bm{X}^{\top}\bm{X})^{-1})^{1/2}]\geq 1$ , and (ii) follows from the condition that $\lambda\lesssim\sigma\sqrt{np}$ (see (A.6)). Combine the above two bounds with $\|\bm{F}\|\leq 2\|\bm{X}^{\star}\|$ (cf. (A.19)) to reach

Moving on to (A.13b), we apply the triangle inequality and (I.3) to see that

In order to control $\|\bm{H}^{\mathsf{d}}-\bm{H}\|$ , we leverage [MWCC17, Lemma 36] to get

where the last relation uses (I.2) and $\left\|\bm{F}^{\star}\right\|\asymp\left\|\bm{X}^{\star}\right\|\asymp\sqrt{\sigma_{\max}}$ . Taking these bounds collectively yields

Now we turn attention to (A.13d). Observe that

where the last bound arises from (A.9c). Going through the same calculation as in bounding $\|\bm{F}^{\mathsf{d}}-\bm{F}\|$ , we arrive at

as long as $\sigma\sqrt{n/p}\ll\sigma_{\min}$ . We can thus continue the upper bound in (I.5) to derive

Here, the second line results from (I.4).

Finally, we deal with (A.13e). From the definition of the de-shrunken estimator (3.8), we have

This combined with the triangle inequality reveals that

Making use of (A.11) and (C.16) allows us to establish the claim.

I.2 Proof of the inequalities (A.16)

The proofs of (A.16a) and (A.16b) are the same as those of (A.13b) and (A.13d), and are hence omitted for conciseness. We are left with (A.16c). Denoting

as long as $\sigma\sqrt{n/p}\ll\sigma_{\min}/\kappa$ . Here the first inequality follows from (A.13a). In addition, we have

Regarding $\theta$ , one can apply the bound (C.16) for $(\bm{X},\bm{Y})$ and a similar bound for $(\bm{X}^{(j)},\bm{Y}^{(j)})$ to obtain

Returning to (I.6), one has by the triangle inequality that

we can apply the perturbation bound for matrix square roots (see Lemma 13) to obtain

where the penultimate relation uses (A.14a) as well as the fact that $\|\bm{F}^{\star}\|_{2,\infty}\geq\sqrt{\sigma_{\min}r/n}$ .

With the above bound in place, we are ready to invoke [CCF+19, Lemma 22] to obtain

where the last line comes from (A.14a). This concludes the proof.

Appendix J Technical lemmas

This section collects a few useful matrix perturbation bounds. The first one is concerned with the perturbation of pseudo-inverses.

Let $\bm{A}^{\dagger}$ (resp. $\bm{B}^{\dagger}$ ) be the pseudo-inverse (i.e. Moore–Penrose inverse) of $\bm{A}$ (resp. $\bm{B}$ ). Then we have

The next lemma focuses on the perturbation bound for matrix square roots.

Consider two symmetric matrices obeying $\bm{A}_{1}\succeq\mu_{1}\bm{I}$ and $\bm{A}_{2}\succeq\mu_{2}\bm{I}$ for some $\mu_{1},\mu_{2}>0$ . Let $\bm{R}_{1}\succeq\bm{0}$ (resp. $\bm{R}_{2}\succeq\bm{0}$ ) be the (principal) matrix square root of $\bm{A}_{1}$ (resp. $\bm{A}_{2}$ ). Then one has

The following lemma concerns the perturbation of top- $r$ components of matrices.

From Wedin’s $\sin\bm{\Theta}$ theorem [Wed72], there exist orthonormal matrices $\bm{R}_{1},\bm{R}_{2}\in\mathcal{O}^{r\times r}$ such that

In addition, Weyl’s inequality tells us that

Here, the second inequality follows from the triangle inequality and the assumption that $\|\bm{E}\|\leq\|\bm{M}\|=\|\bm{\Sigma}\|$ . Expand $\bm{U}\bm{\Sigma}\bm{V}^{\top}-\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{\top}$ and apply the triangle inequality to obtain

Once again, employ (J.1) and (J.2) to arrive at

The last bound centers around the well-known Sylvester equation $\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}$ .

as long as $\lambda_{\min}\bm{I}_{r}\preceq\bm{A}\preceq\lambda_{\max}\bm{I}_{r}$ and $\lambda_{\min}\bm{I}_{r}\preceq\bm{B}\preceq\lambda_{\max}\bm{I}_{r}$ for some $\lambda_{\max}\geq\lambda_{\min}>0$ .

To begin with, we intend to show that under the condition $\lambda_{\min}\bm{I}_{r}\preceq\bm{A},\bm{B}\preceq\lambda_{\max}\bm{I}_{r}$ for some $\lambda_{\max}\geq\lambda_{\min}>0$ , there is a unique solution to the matrix equation $\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}$ . Use the notation of Kronecker product to obtain an equivalent form of $\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}$ as follows

where $\otimes$ denotes the Kronecker product and $\mathsf{vec}(\bm{A})$ stands for the vectorization of the matrix $\bm{A}$ . Given that $\bm{A}\succ\bm{0}$ and $\bm{B}\succ\bm{0}$ , it is straightforward to see that $\bm{A}^{\top}\otimes\bm{I}_{r}+\bm{I}_{r}\otimes\bm{B}$ is invertible, thus justifying the uniqueness of $\bm{X}$ .

The next step is to characterize $\bm{X}$ explicitly. The argument herein is adapted from [Smi68] and [Sch92]. Specifically, it has been shown in [Smi68] that the equation $\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}$ is equivalent to

where $\bm{U}=(q\bm{I}_{r}+\bm{B})^{-1}(q\bm{I}_{r}-\bm{B})$ , $\bm{V}=(q\bm{I}_{r}-\bm{A})(q\bm{I}_{r}+\bm{A})^{-1}$ and $\bm{W}=2q(q\bm{I}_{r}+\bm{B})^{-1}\bm{C}(q\bm{I}_{r}+\bm{A})^{-1}$ , for any $q>0$ . In particular, when $q>\lambda_{\min}$ , the matrix

is the unique solution to $\bm{X}-\bm{U}\bm{X}\bm{V}=\bm{W}$ and hence to $\bm{X}\bm{A}+\bm{B}\bm{X}=\bm{C}$ . To show this, it suffices to verify that the matrix series is convergent. Note that when $q>\lambda_{\min}$ , one has

and similarly $\|\bm{V}\|\leq(q-\lambda_{\min})/(q+\lambda_{\max})<1$ . These two bounds taken together immediately establish the convergence of the matrix series (J.5).

In the end, the explicit representation (J.5) allows us to upper bound $\|\bm{X}\|$ . A little algebra reveals that

where we make use of the fact $\|\bm{U}\|\|\bm{V}\|<1$ . In addition, from the definition of $\bm{W}$ we know that

provided that $q>0$ . Combine this with the bounds on $\|\bm{U}\|$ and $\|\bm{V}\|$ to reach