Completing Any Low-rank Matrix, Provably

Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

Introduction

Low-rank matrix completion has been the subject of much recent study due to its application in myriad tasks: collaborative filtering, dimensionality reduction, clustering, non negative matrix factorization and localization in sensor networks. Clearly, the problem is ill-posed in general; correspondingly, analytical work on the subject has focused on the joint development of algorithms, and sufficient conditions under which such algorithms are able to recover the matrix.

While they differ in scaling/constant factors, all existing sufficient conditions (Candès and Recht, 2009; Candès and Tao, 2010; Recht, 2009; Keshavan et al., 2010; Gross, 2011; Jain et al., 2012; Negahban and Wainwright, 2012)—with a couple of exceptions we describe in Section 2—require that (a) the subset of observed elements should be uniformly randomly chosen, independent of the values of the matrix elements, and (b) the low-rank matrix be “incoherent” or “not spiky”—i.e., its row and column spaces should be diffuse, having low inner products with the standard basis vectors. Under these conditions, the matrix has been shown to be provably recoverable—via methods based on convex optimization (Candès and Recht, 2009), alternating minimization (Jain et al., 2012), iterative thresholding (Cai et al., 2010), etc.—using as few as Θ(nrlogn)\Theta(nr\log n) observed elements for an n×nn\times n matrix of rank rr.

Actually, the incoherence assumption is required because of the uniform sampling: coherent matrices are those which have most of their mass in a relatively small number of elements. By sampling entries uniformly and independently at random, most of the mass of a coherent low-rank matrix will be missed; this could (and does) throw off all existing recovery methods. One could imagine that if the sampling is adapted to the matrix, roughly in a way that ensures that elements with more mass are more likely to be observed, then it may be possible for existing methods to recover the full matrix.

In this paper, we show that the incoherence requirement can be eliminated completely, provided the sampling distribution is dependent on the matrix to be recovered in the right way. Specifically, we have the following results.

If the probability of an element being observed is proportional to the sum of the corresponding row and column leverage scores (which are local versions of the standard incoherence parameter) of the underlying matrix, then an arbitrary rank-rr matrix can be exactly recovered from Θ(nrlog2n)\Theta(nr\log^{2}n) observed elements with high probability, using nuclear norm minimization (Theorem 2 and Corollary 3). In the case when all leverage scores are uniformly bounded from above, our results reduce to existing guarantees for incoherent matrices using uniform sampling. Our sample complexity bound Θ(nrlog2n)\Theta(nr\log^{2}n) is optimal up to a single factor of log2n\log^{2}n, since the degrees of freedom in an n×nn\times n matrix of rank rr is 2nr2nr. Moreover, we show that to complete a coherent matrix, it is necessary (in certain precise sense) to sample according to the leverage scores as above (Theorem 6).

For a matrix whose column space is incoherent and row space is arbitrarily coherent, our results immediately lead to a provably correct sampling scheme which requires no prior knowledge of the leverage scores of the underlying matrix and has near optimal sampling complexity (Corollary 4).

We provide numerical evidence that a two-phase adaptive sampling strategy, which assumes no prior knowledge about the leverage scores of the underlying matrix, can perform on par with the optimal sampling strategy in completing coherent matrices, and significantly outperforms uniform sampling (Section 4). Specifically, we consider a two-phase sampling strategy whereby given a fixed budget of mm samples, we first draw a fixed proportion of samples uniformly at random, and then draw the remaining samples according to the leverage scores of the resulting sampled matrix.

Using our theoretical results, we are able to quantify the benefit of weighted nuclear norm minimization over standard (unweighted) nuclear norm minimization, and provide a strategy for choosing the weights in such problems given non-uniformly distributed samples so as to reduce the sampling complexity of weighted nuclear norm minimization to that of the unweighted formulation (Theorem 7). Our results give the first exact recovery guarantee for weighted nuclear norm minimization, thus providing theoretical justification for its good empirical performance observed in Salakhutdinov and Srebro (2010); Foygel et al. (2011); Negahban and Wainwright (2012).

Related Work

There is now a vast body of literature on matrix completion, and an even bigger body of literature on matrix approximations; we restrict our literature review here to papers that are most directly related.

Exact Completion, Incoherent Matrices, Random Samples: The first algorithm and theoretical guarantees for exact low-rank matrix completion appeared in Candès and Recht (2009); there it was shown that nuclear norm minimization works when the low-rank matrix is incoherent, and the sampling is uniform random and independent of the matrix. Subsequent works have refined provable completion results for incoherent matrices under the uniform random sampling model, both via nuclear norm minimization (Candès and Tao, 2010; Recht, 2009; Gross, 2011; Chen, 2013), and other methods like SVD followed by local descent (Keshavan et al., 2010) and alternating minimization (Jain et al., 2012), etc. The setting with sparse errors and additive noise is also considered (Candès and Plan, 2010; Chandrasekaran et al., 2011; Chen et al., 2013; Candès et al., 2011; Negahban and Wainwright, 2012).

Weighted sampling in compressed sensing: This paper is similar in spirit to recent work in compressed sensing which shows that sparse recovery guarantees traditionally requiring mutual incoherence can be extended to systems which are only weakly incoherent, without any loss of approximation power, provided measurements from the sensing basis are subsampled according to their coherence with the sparsity basis. This notion of local coherence sampling seems to have originated in Rauhut and Ward (2012) in the context of sparse orthogonal polynomial expansions, and has found applications in uncertainty quantification (Yang and Karniadakis, 2013), interpolation with spherical harmonics (Burq et al., 2012), and MRI compressive imaging (Krahmer and Ward, 2012).

Finally, closely related to our paper is the recent work in Krishnamurthy and Singh (2013), which considers matrix completion where only the row space is allowed to be coherent. The proposed adaptive sampling algorithm selects columns to observe in their entirety and requires a total of O(r2nlogr)O(r^{2}n\log r) observed elements, which is quadratic in rr.

We present our main results for coherent matrix completion in Section 3. In Section 4 we propose a two-phase algorithm that requires no prior knowledge about the underlying matrix’s leverage scores. In Section 5 we provide guarantees for weighted nuclear norm minimization. We provide the proofs of the main theorems in the appendix.

Main Results

The results in this paper hold for what is arguably the most popular approach to matrix completion: nuclear norm minimization. If the true matrix is MM with its (i,j)(i,j)-th element denoted by MijM_{ij}, and the set of observed elements is Ω\Omega, this method guesses as the completion the optimum of the convex program:

where the nuclear norm \|\cdot\|_{*} of a matrix is the sum of its singular values.This becomes the trace norm for positive-definite matrices. It is now well-recognized to be a convex surrogate for the rank function (Fazel, 2002). Throughout, we use the standard notation f(n)=Θ(g(n))f(n)=\Theta(g(n)) to mean that cg(n)f(n)Cg(n)cg(n)\leq f(n)\leq Cg(n) for some positive universal constants c,Cc,C.

We focus on the setting where matrix elements are revealed according an underlying probability distribution. To introduce the distribution of interest, we first need a definition.

For an n1×n2n_{1}\times n_{2} real-valued matrix MM of rank rr whose rank-rr SVD is given by UΣVU\Sigma V^{\top}, its (normalized) leverage scoresIn the matrix sparsification literature (Drineas et al., 2012; Boutsidis et al., 2009) and beyond, the leverage scores of MM often refer to the un-normalized quantities Uei2\left\|U^{\top}e_{i}\right\|^{2} and Vej2\left\|V^{\top}e_{j}\right\|^{2}.—μi(M)\mu_{i}(M) for any row ii, and νj(M)\nu_{j}(M) for any column jj—are defined as

where eie_{i} denotes the ii-th standard basis with appropriate dimension.

Note that the leverage scores are non-negative, and are functions of the column and row spaces of the matrix MM. Since UU and VV have orthonormal columns, we always have relationship iμi(M)r/n1=jνj(M)r/n2=r.\sum_{i}\mu_{i}(M)r/n_{1}=\sum_{j}\nu_{j}(M)r/n_{2}=r. The standard incoherence parameter μ0\mu_{0} of MM used in the previous literature corresponds to a global upper bound on the leverage scores:

Therefore, the leverage scores can be considered as the localized versions of the standard incoherence parameter.

We are ready to state our main result, the theorem below.

Let M=(Mij)M=(M_{ij}) be an n1×n2n_{1}\times n_{2} matrix of rank rr, and suppose that its elements MijM_{ij} are observed only over a subset of elements Ω[n1]×[n2]\Omega\subset[n_{1}]\times[n_{2}]. There is a universal constant c0>0c_{0}>0 such that, if each element (i,j)(i,j) is independently observed with probability pijp_{ij}, and pijp_{ij} satisfies

then MM is the unique optimal solution to the nuclear norm minimization problem (1) with probability at least 15(n1+n2)101-5(n_{1}+n_{2})^{-10}.

We will refer to the sampling strategy (3) as leveraged sampling. Note that the expected number of observed elements is i,jpij\sum_{i,j}p_{ij}, and this satisfies

which is independent of the leverage scores, or indeed any other property of the matrix. Hoeffding’s inequality implies that the actual number of observed elements sharply concentrates around its expectation, leading to the following corollary:

Let M=(Mij)M=(M_{ij}) be an n1×n2n_{1}\times n_{2} matrix of rank rr. Draw a subset Ω\Omega of its elements by leveraged sampling according to the procedure described in Theorem 2. There is a universal constant c0>0c_{0}>0 such that the following holds with probability at least 110(n1+n2)101-10(n_{1}+n_{2})^{-10}: the number mm of revealed elements is bounded by

and MM is the unique optimal solution to the nuclear norm minimization program (1).

(A) Roughly speaking, the condition given in (3) ensures that elements in important rows/columns (indicated by large leverage scores μi\mu_{i} and νj\nu_{j}) of the matrix should be observed more often. Note that Theorem 2 only stipulates that an inequality relation hold between pijp_{ij} and {μi(M),νj(M)}\left\{\mu_{i}(M),\nu_{j}(M)\right\}. This allows for there to be some discrepancy between the sampling distribution and the leverage scores. It also has the natural interpretation that the more the sampling distribution {pij}\left\{p_{ij}\right\} is “aligned” to the leverage score pattern of the matrix, the fewer observations are needed.

(B) Sampling based on leverage scores provides close to the optimal number of sampled elements required for exact recovery (when sampled with any distribution). In particular, recall that the number of degrees of freedom of an n×nn\times n matrix of rank rr is 2nr(1r/2n)2nr(1-r/2n), and knowing the leverage scores of the matrix reduces the degrees of freedom by at most 2n2n. Hence, regardless how the elements are sampled, a minimum of Θ(nr)\Theta(nr) elements is required to recover the matrix. Theorem 2 matches this lower bound, with an additional O(log2(n))O(\log^{2}(n)) factor.

where cc is a constant, then MM will be the unique optimum of (1) with high probability. A direct corollary of our work improves on this result, by removing the need for extra constraints on the joint incoherence; in particular, it is easy to see that our theorem implies that a uniform sampling probability of pcμ0rlog2nnp\geq c\frac{\mu_{0}r\log^{2}n}{n}—that is, with no μstr\mu_{\text{str}}—guarantees recovery of MM with high probability. Note that μstr\mu_{\text{str}} can be as high as μ0r\mu_{0}r, for example, in the case when MM is positive semi-definite; our corollary thus removes this sub-optimal dependence on the rank and on the incoherence parameter. This improvement was recently observed in Chen (2013).

Theorem 2 immediately yields a useful result in scenarios where only the row space of a matrix is coherent and one has control over the sampling of the matrix. This setting is considered before by Krishnamurthy and Singh (2013) and is of interest in applications like recommender systems, network tomography and gene expression analysis.

then MM is the unique optimal solution to the nuclear norm minimization program (1). The total number of samples used by the above procedure is at most c1μ0rnlog2nc_{1}\mu_{0}rn\log^{2}n.

Our result improves on the Θ(μ02r2nlogr)\Theta(\mu_{0}^{2}r^{2}n\log r) sample complexity given in Krishnamurthy and Singh (2013), which is quadratic in μ0\mu_{0} and rr. We also note that our sampling strategy is different from theirs: we sample entire rows of MM, whereas they sample entire columns.

2 Necessity of Leveraged Sampling

{pij}\left\{p_{ij}\right\} is said to be location-invariant with respect to the matrix MM if the following are satisfied: (1) For any two rows iii\neq i^{\prime} that are identical, i.e., Mij=MijM_{ij}=M_{i^{\prime}j} for all jj, we have pij=pijp_{ij}=p_{i^{\prime}j} for all jj; (2) For any two columns jjj\neq j^{\prime} that are identical, i.e., Mij=MijM_{ij}=M_{ij^{\prime}} for all ii, we have pij=pijp_{ij}=p_{ij^{\prime}} for all ii.

In other words, {pij}\left\{p_{ij}\right\} is location-invariant with respect to MM if identical rows (or columns) of MM have identical sampling probabilities. We consider this assumption very mild, and it covers the leveraged sampling as well as many other typical sampling schemes, including:

uniform sampling, where pijpp_{ij}\equiv p,

Given two nn-dimensional vectors μ=(μ1,,μn)\vec{\mu}=\left(\mu_{1},\ldots,\mu_{n}\right) and ν=(ν1,,νn)\vec{\nu}=\left(\nu_{1},\ldots,\nu_{n}\right), we use Mr(μ,ν)\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right) to denote the set of rank-rr matrices whose leverage scores are bounded by μ\vec{\mu} and ν\vec{\nu}; that is,

Suppose nr2n\geq r\geq 2. Given any 2r2r numbers a1,,ara_{1},\ldots,a_{r} and b1,,brb_{1},\ldots,b_{r} with r4k=1r1ak,k=1r1bkr\frac{r}{4}\leq\sum_{k=1}^{r}\frac{1}{a_{k}},\sum_{k=1}^{r}\frac{1}{b_{k}}\leq r and 2rak,bk2nr,k[r]\frac{2}{r}\leq a_{k},b_{k}\leq\frac{2n}{r},\forall k\in[r], there exist two nn-dimensional vectors μ\vec{\mu} and ν\vec{\nu} and the corresponding set Mr(μ,ν)\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right) with the following properties:

For each i,j[n]i,j\in[n], μi=ak\mu_{i}=a_{k} and νj=bk\nu_{j}=b_{k^{\prime}} for some k,k[r]k,k^{\prime}\in[r]. That is, the values of the leverage scores are given by {ak}\left\{a_{k}\right\} and {bk}\left\{b_{k^{\prime}{}}\right\}.

There exists a matrix M(0)Mr(μ,ν)M^{(0)}\in\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right) for which the following holds. If {pij}\{p_{ij}\} is location-invariant w.r.t. M(0)M^{(0)}, and for some (i0,j0)(i_{0},j_{0}),

then with probability at least 14\frac{1}{4}, the following conclusion holds: There are infinitely many matrices M(1)M(0)M^{(1)}\neq M^{(0)} in Mr(μ,ν)\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right) such that {pij}\{p_{ij}\} is location-invariant w.r.t. M(1)M^{(1)}, and

then the conclusion above holds with probability at least 1n\frac{1}{n}.

In other words, if (4) holds, then with probability at least 1/41/4, no method can distinguish between M(0)M^{(0)} and M(1)M^{(1)}; similarly, if (5) holds, then with probability at least 1/n1/n no method succeeds. We shall compare these results with Theorem 2, which guarantees that if we use leveraged sampling,

for some universal constant c0c_{0}, then for any matrix M(0)M^{(0)} in Mr(μ,ν)\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right), the nuclear norm minimization approach (1) recovers M(0)M^{(0)} from its observed elements with failure probability no more than 1n\frac{1}{n}. Therefore, under the setting of Theorem 6, leveraged sampling is sufficient and necessary for matrix completion up to one logarithmic factor for a target failure probability 1n\frac{1}{n} (or up to two logarithmic factors for a target failure probability 14\frac{1}{4}).

Admittedly, the setting covered by Theorem 6 has several restrictions on the sampling distributions and the values of the leverage scores. Nevertheless, we believe this result captures some essential difficulties in recovering general coherent matrices, and highlights how the sampling probabilities should relate in a specific way with the leverage score structure of the underlying object.

A Two-Phase Sampling Procedure

We have seen that one can exactly recover an arbitrary n×nn\times n rank-rr matrix using Θ(nrlog2n)\Theta(nr\log^{2}n) elements if sampled in accordance with the leverage scores. In practical applications of matrix completion, even when the user is free to choose how to sample the matrix elements, she may not be privy to the leverage scores {μi(M),νj(M)}\{\mu_{i}(M),\nu_{j}(M)\}. In this section we propose a two-phase sampling procedure, described below and in Algorithm 1, which assumes no a priori knowledge about the matrix leverage scores, yet is observed to be competitive with the “oracle” leveraged sampling distribution (3).

To understand the performance of the two-phase algorithm, assume that the initial set of m1=βmm_{1}=\beta m samples PΩ(M)\mathcal{P}_{\Omega}(M) are generated uniformly at random. If the underlying matrix MM is incoherent, then already the algorithm will recover MM if m1=Θ(nrlog2(2n))m_{1}=\Theta(nr\log^{2}(2n)). On the other hand, if MM is highly coherent, having almost all energy concentrated on just a few elements, then the estimated leverage scores (6) from uniform sampling in the first step will be poor and hence the recovery algorithm suffers. Between these two extremes, there is reason to believe that the two-phase sampling procedure will provide a better estimate to the underlying matrix than if all mm elements were sampled uniformly. Indeed, numerical experiments suggest that the two-phase procedure can indeed significantly outperform uniform sampling for completing coherent matrices.

We now study the performance of the two-phase sampling procedure outlined in Algorithm 1 through numerical experiments. For this, we consider rank-55 matrices of size 500×500500\times 500 of the form M=DUVDM=DUV^{\top}D, where the elements of the matrices UU and VV are i.i.d. Gaussian N(0,1)\mathcal{N}(0,1) and DD is a diagonal matrix with power-law decay, Dii=iα,1i500.D_{ii}=i^{-\alpha},1\leq i\leq 500. We refer to such constructions as power-law matrices. The parameter α\alpha adjusts the leverage scores (and hence the coherence level) of MM with α=0\alpha=0 being maximal incoherence μ0=Θ(1)\mu_{0}=\Theta(1) and α=1\alpha=1 corresponding to maximal coherence μ0=Θ(n)\mu_{0}=\Theta(n).

Figure 1 suggests that the two-phase algorithm performs comparably to the theoretically optimal leverage scores-based distribution (3), despite not having access to the underlying leverage scores, in the regime of mild to moderate coherence. While the element-wise sampling strategies perform comparably for low values of α\alpha, the number of samples for successful recovery increases quickly for α>0.6\alpha>0.6. Completion from purely uniformly sampled elements requires significantly more samples at higher values of α\alpha.

Choosing β\beta: Recall that the parameter β\beta in Algorithm 1 is the fraction uniform samples used to estimate the leverage scores. Figure 2(a) plots the number of samples required for successful recovery (y-axis) as β\beta (x-axis) varies from 0 to 1 for different values of α\alpha. β=1\beta=1 reduces to purely uniform sampling, and for small values of β\beta, the leverage scores estimated in (6) will be far from the actual leverage scores. Then, as expected, the sample complexity goes up for β\beta near 0 and β=1\beta=1. We find the algorithm performs well for a wide range of β\beta, and setting β2/3\beta\approx 2/3 results in the lowest sample complexity. Surprisingly, even taking β=0.9\beta=0.9 as opposed to pure uniform sampling β=1\beta=1 results in a significant decrease in the sample complexity; see Figure 2(b) for more details. That is, even budgeting just a small fraction of samples to be drawn from the estimated leverage scores can significantly improve the success rate in low-rank matrix recovery as long as the underlying matrix is not completely coherent. In applications like collaborative filtering, this would imply that incentivizing just a small fraction of users to rate a few selected movies according to the estimated leverage score distribution obtained by previous samples has the potential to greatly improve the quality of the recovered matrix of preferences.

In Figure 3 we compare the performance of the two-phase algorithm for different values of the matrix dimension nn, and notice for each nn a phase transition occurring at Θ(nlog(n))\Theta(n\log(n)) samples. In Figure 4 we consider the scenario where the samples are noisy and compare the performance of Algorithm 1 to uniform sampling and the theoretically-optimal leveraged sampling from Theorem 2. Specifically we assume that the samples are generated from M+ZM+Z where ZZ is a Gaussian noise matrix. We consider two values for the noise σ=defZF/MF\sigma\stackrel{{\scriptstyle\textrm{def}}}{{=}}\|Z\|_{F}/\|M\|_{F}: σ=0.1\sigma=0.1 and σ=0.2\sigma=0.2. The figures plot error in Frobenius norm MM^F\|M-\hat{M}\|_{F} (y-axis), vs total number of samples mm (x-axis). These plots demonstrate the robustness of the algorithm to noise and once again show that sampling with estimated leverage scores can be as good as sampling with exact leverage scores for matrix recovery using nuclear norm minimization for α0.7\alpha\leq 0.7.

Weighted Nuclear Norm Minimization

Theorem 2 suggests that the more a set of observed elements is aligned with the leverage scores of a matrix, the better will be the performance of nuclear norm minimization. Interestingly, Theorem 2 can also be used in a reverse way: one may adjust the leverage scores to align with a given set of observations. Here we demonstrate an application of this idea in quantifying the benefit of weighted nuclear norm minimization for non-uniform sampling.

In many applications, the revealed elements are given to us, and distributed non-uniformly among the rows and columns. As observed in Salakhutdinov and Srebro (2010), standard unweighted nuclear norm minimization (1) is inefficient in this setting. They propose to instead use weighted nuclear norm minimization for low-rank matrix completion:

Without lost of generality, assume R1R2Rn1R_{1}\leq R_{2}\leq\cdots\leq R_{n_{1}} and C1C2Cn2C_{1}\leq C_{2}\leq\cdots\leq C_{n_{2}}. There exists a universal constant c0c_{0} such that MM is the unique optimum to (7) with probability at least 15(n1+n2)101-5(n_{1}+n_{2})^{-10} provided that for all i,ji,j, pij1min{n1,n2}10p_{ij}\geq\frac{1}{\min\{n_{1},n_{2}\}^{10}} and

We prove this theorem by drawing a connection between the weighted nuclear norm and the leverage scores (2). Define the scaled matrix Mˉ:=RMC\bar{M}:=RMC. Observe that the program (7) is equivalent to first solving the following unweighted problem with scaled observations

and then rescaling the solution Xˉ\bar{X} to return X^=R1XˉC1\hat{X}=R^{-1}\bar{X}C^{-1}. In other words, through the weighted nuclear norm, we convert the problem of completing MM to that of completing Mˉ\bar{M}. This leads to the following observation, which underlines the proof of Theorem 7:

If we can choose the weights RR and CC such that the leverage scores of Mˉ\bar{M}, denoted as μˉi:=μi(Mˉ),νˉj:=νi(Mˉ),i,j[n]\bar{\mu}_{i}:=\mu_{i}(\bar{M}),\bar{\nu}_{j}:=\nu_{i}(\bar{M}),i,j\in[n], are aligned with the non-uniform observations in a way that roughly satisfies the relation (3), then we gain in sample complexity compared to the unweighted approach.

We now quantify this more precisely for a particular class of matrix completion problems.

Suppose n1=n2=nn_{1}=n_{2}=n and the sampling probabilities have a product form: pij=pirpjcp_{ij}=p_{i}^{r}p_{j}^{c}, with p1rp2rpnrp_{1}^{r}\leq p_{2}^{r}\leq\cdots\leq p_{n}^{r} and p1cp2cpncp_{1}^{c}\leq p_{2}^{c}\leq\cdots\leq p_{n}^{c}. If we choose Ri=1npirjpjcR_{i}=\sqrt{\frac{1}{n}p_{i}^{r}\sum_{j^{\prime}}p_{j^{\prime}}^{c}} and Cj=1npjcipirC_{j}=\sqrt{\frac{1}{n}p_{j}^{c}\sum_{i^{\prime}}p_{i^{\prime}}^{r}}—which is suggested by the condition (8)—Theorem 7 asserts that the following is sufficient for recovery of MM with high probability:

We can compare this condition to that required by unweighted nuclear norm minimization: by Theorem 2, the latter requires

That is, the weighted nuclear norm approach succeeds under much less restrictive conditions. In particular, the unweighted approach imposes a condition on the least sampled row and least sampled column, whereas the condition (10) shows that the weighted approach can use the heavily sampled rows/columns to assist the less sampled. This benefit is most significant precisely when the observations are very non-uniform. Indeed, the advantage of the weighted formulation is empirically observed in Salakhutdinov and Srebro (2010); Foygel et al. (2011) with the weights RR and CC chosen as above using the empirical sampling distribution.

We remark that Theorem 7 is the first exact recovery guarantee for weighted nuclear norm minimization. It provides a theoretical explanation, complementary to those in Salakhutdinov and Srebro (2010); Foygel et al. (2011); Negahban and Wainwright (2012), for why the weighted approach is advantageous over the unweighted approach for non-uniform observations. It also serves as a testament to the power of Theorem 2 as a general result on the relationship between sampling and the coherence/leverage score structure of a matrix.

We would like to thank Petros Drineas, Michael Mahoney and Aarti Singh for helpful discussions. R. Ward was supported in part by an NSF CAREER award, AFOSR Young Investigator Program award, and ONR Grant N00014-12-1-0743. S. Sanghavi would like to acknowledge support from the NSF, ARO and DTRA.

A Proof of Theorem 2

We now turn to the details. To simplify the notion, we prove the results for square matrices (n1=n2=nn_{1}=n_{2}=n). The results for non-square matrices are proved in exactly the same fashion. In the sequel by with high probability (w.h.p.) we mean with probability at least 1n201-n^{-20}. In the proof we will show that each random event holds with high probability, and since there are no more than 5n105n^{10} such events, it follows from the union bound that all the events simultaneously hold with probability at least 15n101-5n^{-10}, which is the success probability in the statement of Theorem 2.

A few additional notations are needed. We drop the dependence of μi(M)\mu_{i}(M) and νj(M)\nu_{j}(M) on MM and simply use μi\mu_{i} and νj\nu_{j}. We use cc and its derivatives (c,c0c^{\prime},c_{0}, etc.) for universal positive constants, which may differ from place to place. The inner product between two matrices is given by Y,Z=\mboxtrace(YZ)\left\langle Y,Z\right\rangle=\mbox{trace}(Y^{\top}Z). Recall that UU and VV are the left and right singular vectors of the underlying matrix MM. We need several standard projection operators for matrices. The projections PTP_{T} and PTP_{T^{\bot}} are given by

Following our proof roadmap, we now state a sufficient condition for MM to be the unique optimal solution to the optimization problem (1). This is the content of Proposition 8 below (proved in Section A.1 to follow).

Suppose pij1n10p_{ij}\geq\frac{1}{n^{10}}. The matrix MM is the unique optimal solution to (1) if the following conditions hold.

PTRΩPTPTop12.\left\|P_{T}R_{\Omega}P_{T}-P_{T}\right\|_{op}\leq\frac{1}{2}.

PT(Y)UVF14n5,\left\|P_{T}(Y)-UV^{\top}\right\|_{F}\leq\frac{1}{4n^{5}},

PT(Y)12\left\|P_{T^{\bot}}(Y)\right\|\leq\frac{1}{2}.

We begin by proving that Condition 1 in Proposition 8 is satisfied under the conditions of Theorem 2. This is done in the following lemma (proved in Section A.2 to follow). The lemma shows that RΩR_{\Omega} is close to the identity operator on TT.

If pijmin{c0(μi+νj)rnlogn,1}p_{ij}\geq\min\{c_{0}\frac{(\mu_{i}+\nu_{j})r}{n}\log n,1\} for all (i,j)(i,j) and a sufficiently large c0c_{0}, then w.h.p.

where the operator RΩkR_{\Omega_{k}} is given by

The dual certificate is given Y:=Wk0.Y:=W_{k_{0}}. Clearly PΩ(Y)=YP_{\Omega}(Y)=Y by construction. The proof of Theorem 2 is completed if we show that under the condition in theorem, YY satisfies Conditions 2(a) and 2(b) in Proposition 8 w.h.p.

Suppose ZZ is a fixed n×nn\times n matrix. For some universal constant c>1c>1, we have w.h.p.

If pijmin{c0(μi+νj)rnlogn,1}p_{ij}\geq\min\{c_{0}\frac{(\mu_{i}+\nu_{j})r}{n}\log n,1\} for all (i,j)(i,j), then we further have w.h.p.

The next two lemmas further control the μ(,2)\mu(\infty,2) and μ()\mu(\infty) norms of a matrix after random projections.

Suppose ZZ is a fixed n×nn\times n matrix. If pijmin{c0(μi+νj)rnlogn,1}p_{ij}\geq\min\{c_{0}\frac{(\mu_{i}+\nu_{j})r}{n}\log n,1\} for all i,ji,j and sufficiently large c0c_{0}, then w.h.p.

Suppose ZZ is a fixed n×nn\times n matrix. If pijmin{c0(μi+νj)rnlogn,1}p_{ij}\geq\min\{c_{0}\frac{(\mu_{i}+\nu_{j})r}{n}\log n,1\} for all i,ji,j and c0c_{0} sufficiently large, then w.h.p.

We prove Lemmas 10–12 in Section A.2. Equipped with the three lemmas above, we are now ready to validate that YY satisfies Condition 2 in Proposition 8.

Set Δk=UVPT(Wk)\Delta_{k}=UV^{\top}-P_{T}(W_{k}) for k=1,,k0k=1,\ldots,k_{0}. By definition of WkW_{k}, we have

Note that Ωk\Omega_{k} is independent of Δk1\Delta_{k-1} and qijpij/k0c0(μi+νj)rlog(n)/nq_{ij}\geq p_{ij}/k_{0}\geq c_{0}^{\prime}(\mu_{i}+\nu_{j})r\log(n)/n under the condition in Theorem 2. Applying Lemma 9 with Ω\Omega replaced by Ωk\Omega_{k} , we obtain that w.h.p.

Applying the above inequality recursively with k=k0,k01,,1k=k_{0,}k_{0}-1,\ldots,1 gives

By definition, YY can be rewritten as Y=k=1k0RΩkPTΔk1.Y=\sum_{k=1}^{k_{0}}R_{\Omega_{k}}P_{T}\Delta_{k-1}. It follows that

We apply Lemma 10 with Ω\Omega replaced by Ωk\Omega_{k} to each summand in the last RHS to obtain w.h.p.

We bound each summand in the last RHS. Applying (k1)(k-1) times (14) and Lemma 12 (with Ω\Omega replaced by Ωk\Omega_{k}), we have w.h.p.

for each kk. Similarly, repeatedly applying (14), Lemma 11 and the inequality we just proved above, we obtain w.h.p.

Note that for all (i,j)(i,j), we have (UV)ij=eiUVejμirnνjrn\left|\left(UV^{\top}\right)_{ij}\right|=\left|e_{i}^{\top}UV^{\top}e_{j}\right|\leq\sqrt{\frac{\mu_{i}r}{n}}\sqrt{\frac{\nu_{j}r}{n}}, eiUV2=μirn\left\|e_{i}^{\top}UV^{\top}\right\|_{2}=\sqrt{\frac{\mu_{i}r}{n}} and UVej2=νjrn\left\|UV^{\top}e_{j}\right\|_{2}=\sqrt{\frac{\nu_{j}r}{n}}. Hence UVμ()1\left\|UV^{\top}\right\|_{\mu(\infty)}\leq 1 and UVμ(,2)=1\left\|UV^{\top}\right\|_{\mu(\infty,2)}=1. We conclude that

provided that the constant c0c_{0} in Theorem 2 is sufficiently large. This completes the proof of Theorem 2.

A.1 Proof of Optimality Condition (Proposition 8)

Proof Consider any feasible solution XX to (1) with PΩ(X)=PΩ(M)P_{\Omega}(X)=P_{\Omega}(M). Let GG be an n×nn\times n matrix which satisfies PTG=1\left\|P_{T^{\bot}}G\right\|=1, and PTG,PT(XM)=PT(XM)\left\langle P_{T^{\bot}}G,P_{T^{\bot}}(X-M)\right\rangle=\left\|P_{T^{\bot}}(X-M)\right\|_{*}. Such GG always exists by duality between the nuclear norm and spectral norm. Because UV+PTGUV^{\top}+P_{T^{\bot}}G is a sub-gradient of the function f(Z)=Zf(Z)=\left\|Z\right\|_{*} at Z=MZ=M, we have

But Y,XM=PΩ(Y),PΩ(XM)=0\left\langle Y,X-M\right\rangle=\left\langle P_{\Omega}(Y),P_{\Omega}(X-M)\right\rangle=0 since PΩ(Y)=YP_{\Omega}(Y)=Y. It follows that

where in the last inequality we use conditions 1 and 2 in the proposition. Using Lemma 13 below, we obtain

The RHS is strictly positive for all XX with PΩ(XM)=0P_{\Omega}(X-M)=0 and XMX\neq M. Otherwise we must have PT(XM)=XMP_{T}(X-M)=X-M and PTPΩPT(XM)=0P_{T}P_{\Omega}P_{T}(X-M)=0, contradicting the assumption PTRΩPTPTop12\left\|P_{T}R_{\Omega}P_{T}-P_{T}\right\|_{op}\leq\frac{1}{2}. This proves that MM is the unique optimum.

If pij1n10p_{ij}\geq\frac{1}{n^{10}} for all (i,j)(i,j) and PTRΩPTPTop12\left\|P_{T}R_{\Omega}P_{T}-P_{T}\right\|_{op}\leq\frac{1}{2}, then we have

Note that RΩ1/2R_{\Omega}^{1/2} is self-adjoint and satisfies RΩ1/2RΩ1/2=RΩR_{\Omega}^{1/2}R_{\Omega}^{1/2}=R_{\Omega}. Hence we have

where the last inequality follows from the assumption PTRΩPTPTop12\left\|P_{T}R_{\Omega}P_{T}-P_{T}\right\|_{op}\leq\frac{1}{2}. On the other hand, PΩ(Z)=0P_{\Omega}(Z)=0 implies RΩ1/2(Z)=0R_{\Omega}^{1/2}(Z)=0 and thus

Combining the last two display equations gives

A.2 Proof of Technical Lemmas

We prove the four technical lemmas that are used in the proof of our main theorem. The proofs use the matrix Bernstein inequality given as Theorem 16 in Section E. We also make frequent use of the following facts: for all ii and jj, we have max{μirn,νjrn}1\max\left\{\frac{\mu_{i}r}{n},\frac{\nu_{j}r}{n}\right\}\leq 1 and

We also use the shorthand ab:=min{a,b}.a\wedge b:=\min\{a,b\}.

Putting together, we have that Sij1c0logn\left\|\mathcal{S}_{ij}\right\|\leq\frac{1}{c_{0}\log n} under the condition of the lemma. On the other hand, we have

A.2.2 Proof of Lemma 10

We can write (RΩI)Z\left(R_{\Omega}-I\right)Z as the sum of independent matrices:

The second part of the lemma follows again from applying the matrix Bernstein inequality.

A.2.3 Proof of Lemma 11

Let X=(PTRΩPT)ZX=\left(P_{T}R_{\Omega}-P_{T}\right)Z. By definition we have

where XaX_{a\cdot} and XbX_{\cdot b} are the aa-th row and bb-th column of of XX, respectively. We bound each term in the maximum. Observe that nνbrXb\sqrt{\frac{n}{\nu_{b}r}}X_{\cdot b} can be written as the sum of independent column vectors:

where we use the triangle inequality and the definition of μi\mu_{i} and νb\nu_{b} . Similarly, if jbj\neq b, we have

where we use pib1c0μirlognnp_{ib}\geq 1\wedge\frac{c_{0}\mu_{i}r\log n}{n} and pib1c0μirnνbrnlognp_{ib}\geq 1\wedge c_{0}\sqrt{\frac{\mu_{i}r}{n}\frac{\nu_{b}r}{n}}\log n in the second inequality. For jbj\neq b, we have

where we use pij1c0μirnνjrnlognp_{ij}\geq 1\wedge c_{0}\sqrt{\frac{\mu_{i}r}{n}\frac{\nu_{j}r}{n}}\log n. We thus obtain Sij22c0lognZμ()\left\|S_{ij}\right\|_{2}\leq\frac{2}{c_{0}\log n}\left\|Z\right\|_{\mu(\infty)} for all (i,j)(i,j).

Applying (26), we can bound the first sum by

where we use pib1c0(μi+νb)rnlognp_{ib}\geq 1\wedge\frac{c_{0}(\mu_{i}+\nu_{b})r}{n}\log n in the second inequality. The second sum can be bounded using (27):

for c0c_{0} sufficiently large. Similarly we can bound nμarXa2\left\|\sqrt{\frac{n}{\mu_{a}r}}X_{a\cdot}\right\|_{2} by the same quantity. We take a union bound over all aa and bb to obtain the desired results.

A.2.4 Proof of Lemma 12

Fix a matrix index (a,b)(a,b) and let wab=μarnνbrnw_{ab}=\sqrt{\frac{\mu_{a}r}{n}\frac{\nu_{b}r}{n}}. We can write

which is the sum of independent zero-mean variables. We first compute the following bound:

where we use the fact that the matrices IUUI-UU^{\top} and IVVI-VV^{\top} have spectral norm at most 11. We proceed to bound sij.\left|s_{ij}\right|. Note that

We distinguish four cases. When i=ai=a and j=bj=b, we use (28) and pab1c0(μa+νb)rlog2(n)np_{ab}\geq 1\wedge\frac{c_{0}\left(\mu_{a}+\nu_{b}\right)r\log^{2}(n)}{n} to obtain sijZij/(wijc0logn)Zμ()/(c0logn).\left|s_{ij}\right|\leq\left|Z_{ij}\right|/\left(w_{ij}c_{0}\log n\right)\leq\left\|Z\right\|_{\mu(\infty)}/\left(c_{0}\log n\right). When i=ai=a and jbj\neq b, we apply (28) to get

where (a)(a) follows from pajmin{c0νjrlognn,1}.p_{aj}\geq\min\left\{c_{0}\frac{\nu_{j}r\log n}{n},1\right\}. In a similar fashion, we can show that the same bound holds when iai\neq a and j=bj=b. When iai\neq a and jbj\neq b, we use (28) to get

where (b)(b) follows from pij1c0μirnνjrnlognp_{ij}\geq 1\wedge c_{0}\sqrt{\frac{\mu_{i}r}{n}\frac{\nu_{j}r}{n}}\log n and max{μirn,νjrn}1\max\left\{\sqrt{\frac{\mu_{i}r}{n}},\sqrt{\frac{\nu_{j}r}{n}}\right\}\leq 1. We conclude that sijZμ()/(c0logn)\left|s_{ij}\right|\leq\left\|Z\right\|_{\mu(\infty)}/\left(c_{0}\log n\right) for all (i,j)(i,j).

We bound each of the four sums. By (28) and pab1c0(μa+νb)rlognn1c0(μa+νb)2r2logn2n2p_{ab}\geq 1\wedge\frac{c_{0}(\mu_{a}+\nu_{b})r\log n}{n}\geq 1\wedge\frac{c_{0}(\mu_{a}+\nu_{b})^{2}r^{2}\log n}{2n^{2}}, we have

By (28) and pajwab2wab2(c0waj2νbrnlogn)p_{aj}w_{ab}^{2}\geq w_{ab}^{2}\wedge\left(c_{0}w_{aj}^{2}\frac{\nu_{b}r}{n}\log n\right), we have

which implies i=a,jbZμ()2/(c0logn)\sum_{i=a,j\neq b}\leq\left\|Z\right\|_{\mu(\infty)}^{2}/(c_{0}\log n). Similarly we can bound ia,j=b\sum_{i\neq a,j=b} by the same quantity. Finally, by (28) and pij1(c0μirnνjrnlogn)p_{ij}\geq 1\wedge\left(c_{0}\frac{\mu_{i}r}{n}\frac{\nu_{j}r}{n}\log n\right), we have

which implies ia,jbZμ()2/(c0logn).\sum_{i\neq a,j\neq b}\leq\left\|Z\right\|_{\mu(\infty)}^{2}/(c_{0}\log n). Combining pieces, we obtain

Applying the Bernstein inequality (Theorem 16), we conclude that

w.h.p. for c0c_{0} sufficiently large. The desired result follows from a union bound over all (a,b)(a,b).

B Proof of Corollary 4

Recall the setting: for each row of MM, we pick it and observe all its elements with some probability pp. We need a simple lemma. Let J[n]J\subseteq[n] be the set of the indices of the row picked, and PJ(Z)P_{J}(Z) be the matrix that is obtained from ZZ by zeroing out the rows outside JJ. Recall that UΣVU\Sigma V^{\top} is the SVD of MM.

If μi(M)=maxinrUei2μ0\mu_{i}(M)=\max_{i}\frac{n}{r}\left\|U^{\top}e_{i}\right\|^{2}\leq\mu_{0} and pc0μ0rlognnp\geq c_{0}\frac{\mu_{0}r\log n}{n} for some universal constant c0c_{0}, then with probability at least 12n101-2n^{-10},

It follows from the matrix Bernstein (Theorem 16) that with probability at least 12n101-2n^{-10}

provided that the c0c_{0} in the statement of the lemma is sufficiently large.

and by Hoeffding’s inequality, the actual number of observations is at most two times the expectation with probability at least 1n101-n^{-10} provided c0c_{0} is sufficiently large. The corollary follows from the union bound.

C Proof of Theorem 6

We prove the theorem assuming k=1r1ak=k=1r1bk=r\sum_{k=1}^{r}\frac{1}{a_{k}}=\sum_{k=1}^{r}\frac{1}{b_{k}}=r; extension to the general setting in the theorem statement will only affect the pre-constant in (4) by a factor of at most 22. For each k[r]k\in[r], let sk:=2nakrs_{k}:=\frac{2n}{a_{k}r}, tk:=2nbkrt_{k}:=\frac{2n}{b_{k}r}. We assume the sks_{k}’s and tkt_{k}’s are all integers. Under the assumption on aka_{k} and bkb_{k}, we have 1sk,tkn1\leq s_{k},t_{k}\leq n and k=1rsk=k=1rtk=n\sum_{k=1}^{r}s_{k}=\sum_{k=1}^{r}t_{k}=n. Define the sets Ik:={l=1r1sl+i:i[sk]}I_{k}:=\left\{\sum_{l=1}^{r-1}s_{l}+i:i\in[s_{k}]\right\} and Jk:={l=1r1tl+j:j[tk]}J_{k}:=\left\{\sum_{l=1}^{r-1}t_{l}+j:j\in[t_{k}]\right\}; note that k=1rIk=k=1rJk=[n]\bigcup_{k=1}^{r}I_{k}=\bigcup_{k=1}^{r}J_{k}=[n]. The vectors μ\vec{\mu} and ν\vec{\nu} are given by

It is clear that μ\vec{\mu} and ν\vec{\nu} satisfy the property 1 in the statement of the theorem.

for all iIki\in I_{k}. All other elements of AA are set to zero. Therefore, the kk-th column of AA has sks_{k} non-zero elements equal to 1sk\sqrt{\frac{1}{s_{k}}}, and the columns of AA have disjoint supports.

for all jJkj\in J_{k}. All other elements of BB are set to zero.

Observe that AA is an orthonormal matrix, so

A similar argument shows that νj(M(0))νj,j[n]\nu_{j}\left(M^{(0)}\right)\leq\nu_{j},\forall j\in[n]. Hence M(0)Mr(μ,ν)M^{(0)}\in\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right). We note that M(0)M^{(0)} is a block diagonal matrix with rr blocks where the kk-th block has size sk×tks_{k}\times t_{k}, and M(0F=r\left\|M^{(0}\right\|_{F}=\sqrt{r}.

Consider the i0i_{0} and j0j_{0} in the statement of the theorem. There must exit some k1,k2[r]k_{1},k_{2}\in[r] such that i0Ik1i_{0}\in I_{k_{1}} and j0Jk2j_{0}\in J_{k_{2}}. Assume w.l.o.g. that sk1tk2s_{k_{1}}\geq t_{k_{2}}. then

where η=μi0r2n=1sk1\eta=\frac{\mu_{i_{0}}r}{2n}=\frac{1}{s_{k_{1}}} in part 2 of the theorem and η=2n\eta=\frac{2}{n} is part 3. Because {pij}\left\{p_{ij}\right\} is location-invariant w.r.t. M(0)M^{(0)}, we have

Let Wi:=({i}×Jk2)ΩW_{i}:=\left|\left(\{i\}\times J_{k_{2}}\right)\cap\Omega\right| be the number of observed elements on {i}×Jk2\{i\}\times J_{k_{2}}. Note that for each iIk1,i\in I_{k_{1}}, we have

where we use 1xe2x,0x121-x\geq e^{-2x},\forall 0\leq x\leq\frac{1}{2} in the second inequality. Therefore, there exists iIk1i^{*}\in I_{k_{1}} for which there is no observed element in {i}×Jk2\left\{i^{*}\right\}\times J_{k_{2}} with probability

Choose a number sˉsk1\bar{s}\geq s_{k_{1}}. Let M(1)=AˉBM^{(1)}=\bar{A}B^{\top}, where BB is the same as before and Aˉ\bar{A} is given by

By varying sˉ\bar{s} we can construct infinitely many such M(1)M^{(1)}. Clearly M(1)M^{(1)} is rank-rr. Observe that M(1)M^{(1)} differs from M(0)M^{(0)} only in {i}×Jk2\left\{i^{*}\right\}\times J_{k_{2}}, which are not observed, so

Moreover, the number of elements that M(0)M^{(0)} and M(1)M^{(1)} differ in is Jk2=tk2=2n/(bk2r)=2n/(νj0r),\left|J_{k_{2}}\right|=t_{k_{2}}=2n/\left(b_{k_{2}}r\right)=2n/\left(\nu_{j_{0}}r\right), and

It is also easy to check that any {pij}\left\{p_{ij}\right\} location-invariant w.r.t. M(0)M^{(0)} is also location-invariant to M(1)M^{(1)}. The following lemma guarantees that M(1)Mr(μ,ν)M^{(1)}\in\mathcal{M}_{r}\left(\vec{\mu},\vec{\nu}\right), which completes the proof of the theorem.

The matrix M(1)M^{(1)} constructed above satisfies

Proof Note that by the definition, the leverage scores of a rank-rr matrix MM with SVD M=UΣVM=U\Sigma V^{\top} can be expressed as

where col(M)\text{col}(M) denotes the column space of MM and Pcol(M)()\mathcal{P}_{\text{col}(M)}(\cdot) is the Euclidean projection onto the column space of MM. A similar relation holds for the row leverage scores and the row space of MM. In other words, the column/row leverage scores of a matrix are determined by its column/row space. Because M(0)M^{(0)} and M(1)M^{(1)} have the same row space (which is the span of the columns of BB), the second set of equalities in the lemma hold.

It remains to prove the first set of inequalities for the column leverage scores. If k1=k2k_{1}=k_{2}, then the columns of Aˉ\bar{A} have unit norms and are orthogonal to each other. Using the above expression for the leverage scores, we have

D Proof of Theorem 7

Suppose the rank-rr SVD of Mˉ\bar{M} is UˉΣˉVˉ\bar{U}\bar{\Sigma}\bar{V}^{\top}; so UˉΣˉVˉ=RMC=RUΣVC\bar{U}\bar{\Sigma}\bar{V}^{\top}=RMC=RU\Sigma V^{\top}C. By definition, we have

where σr()\sigma_{r}(\cdot) denotes the rr-th singular value and the last inequality follows from the standard incoherence assumption maxi,j{μi,νj}μ0\max_{i,j}\{\mu_{i},\nu_{j}\}\leq\mu_{0}. We now bound σr(RU)\sigma_{r}\left(RU\right). Since RURU has rank rr, we have

If we let zi:=eiUx2z_{i}:=\left|e_{i}^{\top}Ux\right|^{2} for each i[n]i\in[n], then ziz_{i} satisfies

and by the standard incoherence assumption,

Therefore, the value of the minimization above is lower-bounded by

From the theory of linear programming, we know the minimum is achieved at an extreme point zz^{*} of the feasible set. The extreme point zz^{*} satisfies zi0,iz_{i}^{*}\geq 0,\forall i and nn linear equalities

for some index sets I1I_{1} and I2I_{2} such that I1I2=φI_{1}\cap I_{2}=\varphi,I1+I2=n1\left|I_{1}\right|+\left|I_{2}\right|=n-1. It is easy to see that we must have I2=nμ0r\left|I_{2}\right|=\left\lfloor\frac{n}{\mu_{0}r}\right\rfloor. Since R1R2RnR_{1}\leq R_{2}\leq\ldots\leq R_{n}, the minimizer zz^{*} has the form

and the value of the minimization (30) is at least

This proves that σr2(RU)μ0rni=1n/(μ0r)Ri2.\sigma_{r}^{2}\left(RU\right)\geq\frac{\mu_{0}r}{n}\sum_{i=1}^{\left\lfloor n/(\mu_{0}r)\right\rfloor}R_{i}^{2}. Combining with (29), we obtain that

the proof for νˉj\bar{\nu}_{j} is similar. Applying Theorem 2 to the equivalent problem (9) with the above bounds on μˉi\bar{\mu}_{i} and νˉj\bar{\nu}_{j} proves the theorem.

E Matrix Bernstein Inequality

and XkB\left\|X_{k}\right\|\leq B almost surely for all kk. Then for any c>0c>0, we have

with probability at least 1(n1+n2)(c1).1-(n_{1}+n_{2})^{-(c-1)}.

References