Low-rank Matrix Completion using Alternating Minimization

Prateek Jain, Praneeth Netrapalli, Sujay Sanghavi

Introduction

Due to the above two reasons, in several applications, the target matrix XX is parameterized by X=UVX=UV^{\dagger}. For example, clustering , sparse PCA etc.

Using the bi-linear parametrization of the target matrix XX, the task of estimating XX now reduces to finding UU and VV that, for example, minimize an error metric. The resulting problem is typically non-convex due to bi-linearity. Correspondingly, a popular approach has been to use alternating minimization: iteratively keep one of U,VU,V fixed and optimize over the other, then switch and repeat, see e.g. . While the overall problem is non-convex, each sub-problem is typically convex and can be solved efficiently.

Despite wide usage of bi-linear representation and alternating minimization, there has been to date almost no theoretical understanding of when such a formulation works. Motivated by this disconnect between theory and practice in the estimation of low-rank matrices, in this paper we provide the first guarantees for performance of alternating minimization, for two low-rank matrix recovery problems: matrix completion, and matrix sensing.

Matrix completion involves completing a low-rank matrix, by observing only a few of its elements. Its recent popularity, and primary motivation, comes from recommendation systems , where the task is to complete a user-item ratings matrix using only a small number of ratings. As elaborated in Section 2, alternating minimization becomes particularly appealing for this problem as it provides a fast, distributed algorithm that can exploit both sparsity of ratings as well as the low-rank bi-linear parametrization of XX.

Our Results

In this section, we will first define the matrix sensing problem, and present our results for it. Subsequently, we will do the same for matrix completion. The matrix sensing setting – i.e. recovery of any low-rank matrix from linear measurements that satisfy matrix RIP – represents an easier analytical setting than matrix completion, but still captures several key properties of the problem that helps us in developing an analysis for matrix completion.We note that for either problem, ours represent the first global optimality guarantees for alternating minimization based algorithms.

As in the existing work on this problem, we are interested in the under-determined case, where d<mnd<mn. Note that this problem is a strict generalization of the popular compressed sensing problem ; compressed sensing represents the case when MM is restricted to be a diagonal matrix.

As mentioned earlier, alternating minimization algorithm for matrix sensing now alternately solves for UU and VV while fixing the other factor. See Algorithm 1 for a pseudo-code of AltMinSense algorithm that we analyze.

We note two key properties of AltMinSense : a) Each minimization – over UU with VV fixed, and vice versa – is a simple least-squares problem, which can be solved in time O(dn2k2+n3k3)O(dn^{2}k^{2}+n^{3}k^{3})Throughout this paper, we assume mnm\leq n., b) We initialize U0U^{0} to be the top-kk left singular vectors of iAibi\sum_{i}A_{i}b_{i} (step 2 of Algorithm 1). As we will see later in Section 4, this provides a good initialization point for the sensing problem which is crucial; if the first iterate U^0\widehat{U}^{0} is orthogonal, or almost orthogonal, to the true UU^{*} subspace, AltMinSense may never converge to the true space (this is easy to see in the simplest case, when the map is identity, i.e. A(X)=X{\mathcal{A}}(X)=X – in which case AltMinSense just becomes the power method).

In general, since d<mnd<mn, problem (LRMS) is not well posed as there can be multiple rank-kk solutions that satisfy A(X)=b\mathcal{A}(X)=b. However, inspired by a similar condition in compressed sensing , Recht et al. showed that if the linear map A{\mathcal{A}} satisfies a (matrix) restricted isometry property (RIP), then a trace-norm based convex relaxation of (LRMS) leads to exact recovery. This property is defined below.

Several random matrix ensembles with sufficiently many measurements (dd) satisfy matrix RIP . For example, if d=Ω(1δk2knlogn)d=\Omega(\frac{1}{\delta_{k}^{2}}kn\log n) and each entry of AiA_{i} is sampled i.i.d. from a -mean sub-Gaussian distribution then kk-RIP is satisfied with RIP constant δk\delta_{k}.

We now present our main result for AltMinSense.

The above theorem establishes geometric convergence (in O(log(1/ϵ))O(\log(1/\epsilon)) steps) of AltMinSense to the optimal solution of (LRMS) under standard RIP assumptions. This is in contrast to existing iterative methods for trace-norm minimization all of which require at least O(1ϵ)O(\frac{1}{\sqrt{\epsilon}}) steps; interior point methods for trace-norm minimization converge to the optimum in O(log(1/ϵ))O(\log(1/\epsilon)) steps but require storage of the full m×nm\times n matrix and require O(n5)O(n^{5}) time per step, which makes it infeasible for even moderate sized problems.

Recently, several projected gradient based methods have been developed for matrix sensing that also guarantee convergence to the optimum in O(log(1/ϵ))O(\log(1/\epsilon)) steps. But each iteration in these algorithms requires computation of the top kk singular components of an m×nm\times n matrix, which is typically significantly slower than solving a least squares problem (as required by each iteration of AltMinSense). Stagewise AltMinSense Algorithm: A drawback of our analysis for AltMinSense is the dependence of δ2k\delta_{2k} on the condition number (κ=σ1σk\kappa=\frac{\sigma^{*}_{1}}{\sigma^{*}_{k}}) of MM, which implies that the number of measurements dd required by AltMinSense grows quadratically with κ\kappa. We address this issue by using a stagewise version of AltMinSense (Algorithm 3) for which we are able to obtain near optimal measurement requirement.

The key idea behind our stagewise algorithm is that if one of the singular vectors of MM is very dominant, then we can treat the underlying matrix as a rank-11 matrix plus noise and approximately recover the top singular vector. Once we remove this singular vector from the measurements, we will have a relatively well-conditioned problem. Hence, at each stage of Algorithm 3, we seek to remove the remaining most dominant singular vector of MM. See Section 6 for more details; here we state the corresponding theorem regarding the performance of Stage-AltMin.

where T=Ω(log(MF2/ϵ))T=\Omega\left(\log(\|M\|_{F}^{2}/\epsilon)\right). That is, the TT-th step iterates of the kk-th stage, satisfy: MU^1:kT(V1:kT)F2ϵ.\|M-\widehat{U}^{T}_{1:k}\left(V^{T}_{1:k}\right)^{{\dagger}}\|_{F}^{2}\leq\epsilon.

The above theorem guarantees exact recovery using O(k4nlogn)O(k^{4}n\log n) measurements which is only O(k3)O(k^{3}) worse than the information theoretic lower bound. We also note that for simplicity of analysis, we did not optimize the constant factors in δ2k\delta_{2k}.

Matrix Completion via Alternating Minimization

Nevertheless, like matrix sensing, matrix completion has been shown to be possible once additional conditions are applied to the low-rank matrix MM and the observation set Ω\Omega. Starting with the first work , the typical assumption has been to have Ω\Omega generated uniformly at random, and MM to satisfy a particular incoherence property that, loosely speaking, makes it very far from a sparse matrix. In this paper, we show that once such assumptions are made, alternating minimization also succeeds. We now restate, and subsequently use, this incoherence definition.

where M=UΣVTM=U\Sigma V^{T} is the SVD of MM and u(i)u^{(i)}, v(j)v^{(j)} denote the ithi^{\textrm{th}} row of UU and the jthj^{\textrm{th}} row of VV respectively.

The alternating minimization algorithm can be viewed as an approximate way to solve the following non-convex problem:

Similar to AltMinSense, the altmin procedure proceeds by alternatively solving for UU and VV. As noted earlier, this approach has been popular in practice and has seen several variants and extensions being used in practice . However, for ease of analysis, our algorithm further modifies the standard alternating minimization method. In particular, we introduce partitioning of the observed set Ω\Omega, so that we use different partitions of Ω\Omega in each iteration. See Algorithm 2 for a pseudo-code of our variant of the alternating minimization approach.

We now present our main result for (LRMC):

where δ2kσk12kσ1\delta_{2k}\leq\frac{\sigma^{*}_{k}}{12k\sigma^{*}_{1}} and C>0C>0 is a global constant. Then w.h.p. for T=ClogMFϵT=C^{\prime}\log\frac{\left\|{M}\right\|_{\text{F}}}{\epsilon}, the outputs U^T\widehat{U}^{T} and VTV^{T} of Algorithm 2, with input (Ω,PΩ(M))(\Omega,P_{\Omega}(M)) (see Equation (2)) satisfy: MU^T(VT)Fϵ.\left\|{M-\widehat{U}^{T}\left(V^{T}\right)^{{\dagger}}}\right\|_{\text{F}}\leq\epsilon.

The above theorem implies that by observing Ω=O((σ1σk)4k4.5nlognlog(kMF/ϵ))\left|\Omega\right|=O\left((\frac{\sigma^{*}_{1}}{\sigma^{*}_{k}})^{4}k^{4.5}n\log n\log(k\|M\|_{F}/\epsilon)\right) random entries of an incoherent MM, AltMinComplete can recover MM in O(log(1/ϵ))O(\log(1/\epsilon)) steps. In terms of sample complexity (Ω|\Omega|), our results show alternating minimization may require a bigger Ω\Omega than convex optimization, as our result has Ω|\Omega| depend on the condition number, required accuracy (ϵ\epsilon) and worse dependence on kk than known bounds. In contrast, trace-norm minimization based methods require O(knlogn)O(kn\log n) samples only.

Empirically however, this is not seen to be the case and we leave further tightening of the sample complexity bounds for matrix completion as an open problem.

In terms of time complexity, we show that AltMinComplete needs time O(Ωk2log(1/ϵ))O(|\Omega|k^{2}\log(1/\epsilon)). This is in contrast to popular trace-norm minimization based methods that need O(1/ϵ)O(1/\sqrt{\epsilon}) steps and total time complexity of O(Ωn/ϵ)O(|\Omega|n/\sqrt{\epsilon}); note that the latter can be potentially quadratic in nn. Furthermore, each step of such methods requires computation of the SVD of an m×nm\times n matrix. As mentioned earlier, interior point methods for trace-norm minimization also converge in O(log(1/ϵ))O(\log(1/\epsilon)) steps but each iteration requires O(n5)O(n^{5}) steps and need storage of the entire m×nm\times n matrix XX.

Related Work

Alternating Minimization: Alternating minimization and its variants have been applied to several low-rank matrix estimation problems. For example, clustering , sparse PCA , non-negative matrix factorization , signed network prediction etc. There are three main reasons for such wide applicability of this approach: a) low-memory footprint and fast iterations, b) flexible modeling, c) amenable to parallelization. However, despite such empirical success, this approach has largely been used as a heuristic and has had no theoretical analysis other than the guarantees of convergence to the local minima . Ours is the first analysis of this approach for two practically important problems: a) matrix completion, b) matrix sensing.

Matrix Completion: This is the problem of completing a low-rank matrix from a few sampled entries. Candes and Recht provided the first results on this problem, showing that under the random sampling and incoherence conditions (detailed above), O(kn1.2logn)O(kn^{1.2}\log n) samples allow for recovery via convex trace-norm minimization; this was improved to O(knlogn)O(kn\log n) in . For large matrices, this approach is not very attractive due to the need to store and update the entire matrix, and because iterative methods for trace norm minimization require O(1ϵ)O(\frac{1}{\sqrt{\epsilon}}) steps to achieve additive error of ϵ\epsilon. Moreover, each such step needs to compute an SVD.

Another approach, in , involved taking a single SVD, followed by gradient descent on a Grassmanian manifold. However, (a) this is more expensive than alternating minimization as it needs to compute gradient over Grassmanian manifold which in general is a computationally intensive step, and (b) the analysis of the algorithm only guarantees asymptotic convergence, and in the worst case might take exponential time in the problem size.

Recently, several other matrix completion type of problems have been studied in the literature. For example, robust PCA , spectral clustering etc. Here again, under additional assumptions, convex relaxation based methods have rigorous analysis but alternating minimization based algorithms continue to be algorithms of choice in practice.

Matrix Sensing: The general problem of matrix sensing was first proposed by . They established recovery via trace norm minimization, assuming the sensing operator satisfies “restricted isometry” conditions. Subsequently, several other methods were proposed for this problem that also recovers the underlying matrix with optimal number of measurements and can give an ϵ\epsilon-additive approximation in time O(log(1/ϵ)O(\log(1/\epsilon). But, similar to matrix completion, most of these methods require computing SVD of a large matrix at each step and hence have poor scalability to large problems.

We show that AltMinSense and AltMin-Completion provide more scalable algorithms for their respective problems. We demonstrate that these algorithms have geometric convergence to the optima, while each iteration is relatively cheap. For this, we assume conditions similar to those required by existing algorithms; albeit, with one drawback: number of samples required by our analysis depend on the condition number of the underlying matrix MM. For the matrix sensing problem, we remove this requirement by using a stagewise algorithm; we leave similar analysis for matrix completion as an open problem.

Matrix Sensing

In this section, we study the matrix sensing problem (LRMS) and prove that if the measurement operator, A{\mathcal{A}}, satisfies RIP then AltMinSense (Algorithm 1) recovers the underlying low-rank matrix exactly (see Theorem 2.2).

At a high level, we prove Theorem 2.2 by showing that the “distance” between subspaces spanned by V^t\widehat{V}^{t} (iterate at time tt) and VV^{*} decreases exponentially with tt. This done based on the observation that once the (standard) matrix RIP condition (Definition 2.1) holds, alternating minimization can be viewed, and analyzed, as a perturbed version of the power method. This is easiest to see for the rank-1 case below; we detail this proof, and then the more general rank-kk case.

In this paper, we use the following definition of distance between subspaces:

where UU and WW are orthonormal bases of the spaces Span(U^)\textrm{Span}\left(\widehat{U}\right) and Span(W^)\textrm{Span}\left(\widehat{W}\right), respectively. Similarly, UU_{\perp} and WW_{\perp} are any orthonormal bases of the perpendicular spaces Span(U)\textrm{Span}\left(U\right)^{\perp} and Span(W)\textrm{Span}\left(W\right)^{\perp}, respectively.

We now present a theorem that bounds the distance between the subspaces spanned by V^t\widehat{V}^{t} and VV^{*} and show that it decreases exponentially with tt.

Let b=A(M)b={\mathcal{A}}(M) where MM and A{\mathcal{A}} satisfy assumptions given in Theorem 2.2. Then, the (t+1)(t+1)-th iterates U^t+1\widehat{U}^{t+1}, V^t+1\widehat{V}^{t+1} of AltMinSense satisfy:

Using Theorem 4.2, we are now ready to prove Theorem 2.2.

Assuming correctness of Theorem 4.2, Theorem 2.2 follows by using the following set of inequalities:

To complete the proof of Theorem 2.2, we now need to prove Theorem 4.2. In the next section, we illustrate the main ideas of the proof of Theorem 4.2 by applying it to a rank-11 matrix i.e., when k=1k=1. We then provide a proof of Theorem 4.2 for arbitrary kk in Section 4.2.

Consider the tt-th update step in the AltMinSense procedure. As v^t+1=\widehat{v}^{t+1}= argminv^i=1d(u^tAiv^σuAiv)2\operatorname*{argmin}_{\widehat{v}}\sum_{i=1}^{d}\left(\widehat{u}^{t{\dagger}}A_{i}^{\dagger}\widehat{v}-\sigma^{*}u^{*^{\dagger}}A_{i}^{\dagger}v^{*}\right)^{2}, setting the gradient of the above objective function to , we obtain:

where ut=u^t/u^t2u^{t}=\widehat{u}^{t}/\|\widehat{u}^{t}\|_{2}. Now, let B=i=1dAiut(ut)AiB=\sum_{i=1}^{d}A_{i}u^{t}(u^{t})^{\dagger}A_{i}^{\dagger} and C=i=1dAiut(u)AiC=\sum_{i=1}^{d}A_{i}u^{t}(u^{*})^{\dagger}A_{i}^{\dagger}. Then,

Note that the first term in the above expression is the power method iterate (i.e., MutM^{{\dagger}}u^{t}). The second term is an error term and the goal is to show that it becomes smaller as utu^{t} gets closer to uu^{*}. Note that when ut=uu^{t}=u^{*}, the error term is irrespective of the measurement operator A{\mathcal{A}}.

Below, we provide a precise bound on the error term:

Consider the error term defined in (4) and let A{\mathcal{A}} satisfy 22-RIP with constant δ2\delta_{2}. Then,

See Appendix B.1 for a detailed proof of the above lemma.

Using the above lemma, we now finish the proof of Theorem 4.2:

Let vt+1=v^t+1/v^t+12v^{t+1}=\widehat{v}^{t+1}/\|\widehat{v}^{t+1}\|_{2}. Now, using (4) and Lemma 4.3:,

where δ^2=3δ213δ2\widehat{\delta}_{2}=\frac{3\delta_{2}}{1-3\delta_{2}}. That is,

Therefore, u0,u12δ25δ^2\langle u^{0},u^{*}\rangle\geq\sqrt{1-2\delta_{2}}\geq 5\widehat{\delta}_{2} assuming δ21100\delta_{2}\leq\frac{1}{100}. ∎

2 Rank-k𝑘k Case

In this section, we present the proof of Theorem 4.2 for arbitrary kk, i.e., when MM is a rank-kk matrix (with SVD UΣ(V))U^{*}\Sigma^{*}\left(V^{*}\right)^{{\dagger}}).

Similar to the analysis for the rank-11 case (Section 4.1), we show that even for arbitrary kk, the updates of AltMinSense are essentially power-method type updates but with a bounded error term whose magnitude decreases with each iteration.

However, directly analyzing iterates of AltMinSense is a bit tedious due to non-orthonormality of intermediate iterates U^\widehat{U}. Instead, for analysis only we consider the iterates of a modified version of AltMinSense, where we explicitly orthonormalize each iterate using the QR-decompositionThe QR decomposition factorizes a matrix into an orthonormal matrix (a basis of its column space) and an upper triangular matrix; that is given S^\widehat{S} it computes S^=SR\widehat{S}=SR where SS has orthonormal columns and RR is upper triangular. If S^\widehat{S} is full-rank, so are SS and RR.. In particular, suppose we replace steps 4 and 5 of AltMinSensewith the following

Let U^t\widehat{U}^{t} be the ttht^{th} iterate of our AltMinSense algorithm, and U~t\widetilde{U}^{t} of the modified version stated above. Suppose also that both U^t,U~t\widehat{U}^{t},\widetilde{U}^{t} are full-rank, and span the same subspace. Then the same will be true for the subsequent iterates for the two algorithms, i.e. Span(V^t+1)=Span(V~t+1)Span(\widehat{V}^{t+1})=Span(\widetilde{V}^{t+1}), Span(U^t+1)=Span(U~t+1)Span(\widehat{U}^{t+1})=Span(\widetilde{U}^{t+1}), and all matrices at iterate t+1t+1 will be full-rank.

The proof of the above lemma can be found in Appendix B.2. In light of this, we will now prove Theorem 4.2 with the new QR-based iterates (5).

Let U^t\widehat{U}^{t} be the tt-th step iterate of AltMinSense and let Ut,V^t+1U^{t},\widehat{V}^{t+1} and Vt+1V^{t+1} be obtained by Update (5). Then,

where FF is an error matrix defined in (18) and R(t+1)R^{(t+1)} is a triangular matrix obtained using QR-decomposition of V^t+1\widehat{V}^{t+1}.

See Appendix B for a detailed proof of the above lemma.

Note that the notation above should have been Bt,CtB^{t},C^{t} and so on. We suppress the dependence on tt for notational simplicity. Now, from Update (6), we have

where VV_{\perp}^{*} is an orthonormal basis of Span(v1,v2,,vk)\textrm{Span}\left(v_{1}^{*},v_{2}^{*},\cdots,v_{k}^{*}\right)^{\perp}. Therefore,

Now, we break down the proof of Theorem 4.2 into the following two steps:

show that F(Σ)12\left\|{F(\Sigma^{*})^{-1}}\right\|_{2} is small (Lemma 4.6) and

show that ΣR(t+1)12\|\Sigma^{*}{R^{(t+1)}}^{-1}\|_{2} is small(Lemma 4.7).

We will now state the two corresponding lemmas. Complete proofs can be found in Appendix B.2 The first lemma bounds the spectral norm of F(Σ)1F(\Sigma^{*})^{-1}.

The following lemma bounds the spectral norm of ΣR(t+1)1\Sigma^{*}R^{(t+1)^{-1}}.

With the above two lemmas, we now prove Theorem 4.2.

Using (19), (20) and (21), we obtain the following:

Matrix Completion

In this section, we study the Matrix Completion problem (LRMC) and show that, assuming kk and σ1σk\frac{\sigma^{*}_{1}}{\sigma^{*}_{k}} are constant, AltMinComplete (Algorithm 2) recovers the underlying matrix MM using only O(nlogn)O(n\log n) measurements (i.e., we prove Theorem 2.5).

As mentioned, while observing elements in Ω\Omega constitutes a linear map, matrix completion is different from matrix sensing because the map does not satisfy RIP. The (now standard) approach is to assume incoherence of the true matrix MM, as done in Definition 2.4. With this, and the random sampling of Ω\Omega, matrix completion exhibits similarities to matrix sensing. For our analysis, we can again use the fact that incoherence allows us to view alternating minimization as a perturbed power method, whose error we can control.

However, there are important differences between the two problems, which make the analysis of completion more complicated. Chief among them is the fact that we need to establish the incoherence of each iterate. For the first initialization U^0\widehat{U}^{0}, this necessitates the “clipping” procedure (described in step 4 of the algorithm). For the subsequent steps, this requires the partitioning of the observed Ω\Omega into 2T+12T+1 sets (as described in step 2 of the algorithm).

As in the case of matrix sensing, we prove our main result for matrix completion (Theorem 2.5) by first establishing a geometric decay of the distance between the subspaces spanned by U^t,V^t\widehat{U}^{t},\widehat{V}^{t} and U,VU^{*},V^{*} respectively.

Under the assumptions of Theorem 2.5, the (t+1)th(t+1)^{\textrm{th}} iterates U^t+1\widehat{U}^{t+1} and V^t+1\widehat{V}^{t+1} satisfy the following property w.h.p.:

We use the above result along with incoherence of MM to prove Theorem 2.5. See Appendix C for a detailed proof.

Now, similar to the matrix sensing case, alternating minimization needs an initial iterate that is close enough to UU^{*} and VV^{*}, from where it will then converge. To this end, Steps 343-4 of Algorithm 2 use SVD of PΩ(M)P_{\Omega}(M) followed by clipping to initialize U^0\widehat{U}^{0}. While the SVD step guarantees that U^0\widehat{U}^{0} is close enough to UU^{*}, it might not remain incoherent. To maintain incoherence, we introduce an extra clipping step which guarantees incoherence of U^0\widehat{U}^{0} while also ensuring that U^0\widehat{U}^{0} is close enough to UU^{*} (see Lemma 5.2)

Let M,Ω,pM,\Omega,p be as defined in Theorem 2.5. Also, let U0U^{0} be the initial iterate obtained by step 44 of Algorithm 2. Then, w.h.p. we have

U0U^{0} is incoherent with parameter 4μk4\mu\sqrt{k}.

The above lemma guarantees a “good” starting point for alternating minimization. Using this, we now present a proof of Theorem 5.1. Similar to the sensing section, we first explain key ideas of our proof using rank-11 example. Then in Section 5.2 we extend our proof to general rank-kk matrices.

Consider the rank-11 matrix completion problem where M=σu(v)M=\sigma^{*}u^{*}(v^{*})^{\dagger}. Now, the tt-th step iterates v^t+1\widehat{v}^{t+1} of Algorithm 2 are given by:

Let ut=u^t/u^t2u^{t}=\widehat{u}^{t}/\|\widehat{u}^{t}\|_{2}. Then, j\forall j:

Note the similarities between the update (25) and the rank-11 update (4) for the sensing case. Here again, it is essentially a power-method update (first term) along with a bounded error term (see Lemma 5.3). Using this insight, we now prove Theorem 5.1 for the special case of rank-11 matrices. Our proof can be divided in three major steps:

Base Case: Show that u0=u^0/u^02u^{0}=\widehat{u}^{0}/\|\widehat{u}^{0}\|_{2} is incoherent and have small distance to uu^{*} (see Lemma 5.2).

Induction Step (distance): Assuming ut=u^t/u^t2u^{t}=\widehat{u}^{t}/\|\widehat{u}^{t}\|_{2} to be incoherent and that utu^{t} has a small distance to uu^{*}, vt+1v^{t+1} decreases distances to vv^{*} by at least a constant factor.

Induction Step (incoherence): Show incoherence of vt+1v^{t+1}, while assuming incoherence of utu^{t} (see Lemma 5.4)

We first prove the second step of our proof. To this end, we provide the following lemma that bounds the error term. See Appendix C.2 for a proof of the below given lemma.

Let MM, pp, Ω\Omega, utu^{t} be as defined in Theorem 2.5. Also, let utu^{t} be a unit vector with incoherence parameter μ1=6(1+δ2)μ1δ2\mu_{1}=\frac{6(1+\delta_{2})\mu}{1-\delta_{2}}.Then, w.p. at least 11n31-\frac{1}{n^{3}}:

Multiplying (25) with vv^{*} and using Lemma 5.3, we get:

where δ2<112\delta_{2}<\frac{1}{12} is a constant defined in the Theorem statement and is similar to the RIP constant in Section 4.

Similarly, by multiplying (25) with vv_{\perp} (where v,v=0\langle v_{\perp}^{*},v^{*}\rangle=0 and v2=1\|v_{\perp}^{*}\|_{2}=1) and using Lemma 5.3:

Assuming, vt+1,v6δ2\langle v^{t+1},v^{*}\rangle\geq 6\delta_{2},

We now provide the following lemma to prove the third step. We stress that vt+1v^{t+1} does not increase the incoherence parameter (μ1\mu_{1}) when compared to that of utu^{t}.

Let MM, pp, Ω\Omega be as defined in Theorem 2.5. Also, let utu^{t} be a unit vector with incoherence parameter μ1=6(1+δ2)μ1δ2\mu_{1}=\frac{6(1+\delta_{2})\mu}{1-\delta_{2}}. Then, w.p. at least 11n31-\frac{1}{n^{3}}, vt+1v^{t+1} is also μ1\mu_{1} incoherent.

See Appendix C.2 for a detailed proof of the lemma.

Finally, for the base case we need that u0u^{0} is μ1\mu_{1} incoherent and also u0,u6δ2\langle u^{0},u^{*}\rangle\geq 6\delta_{2}. This follows directly by using Lemma 5.2 and the fact that δ21/12\delta_{2}\leq 1/12.

Note that, to obtain an error of ϵ\epsilon, AltMinComplete needs to run for O(logMFϵ)O\left(\log\frac{\|M\|_{F}}{\epsilon}\right) iterations. Also, we need to sample a fresh Ω\Omega at each iteration of AltMinComplete. Hence, the total number of samples needed by AltMinComplete is O(logMFϵ)O\left(\log\frac{\|M\|_{F}}{\epsilon}\right) larger than the number of samples required per step.

2 Rank-k𝑘k case

We now extend our proof of Theorem 5.1 to matrices with arbitrary rank. Here again, we show that the AltMinComplete algorithm reduces to power method with bounded perturbation at each step.

Similar to the matrix sensing case, we analyze the following QR decomposition based update instead of directly analyzing the updates of Algorithm 2:

Here again, we would stress that the updates output exactly the same matrices at the end of each iteration and we prefer QR-based updates due to notational ease.

Now, as matrix completion is a special case of matrix sensing, Lemma 4.5 characterizes the updates of the AltMinComplete algorithm (see Algorithm 2). That is,

where FF is the error matrix defined in (18) and R(t+1)R^{(t+1)} is a upper-triangular matrix obtained using QR-decomposition of V^t+1\widehat{V}^{t+1}. See (13) for the definition of B,CB,C, DD, and SS.

Also, note that for the special case of matrix completion, Bpq,Cpq,1p,qkB_{pq},C_{pq},1\leq p,q\leq k are diagonal matrices with

and Dj=(Ut)UD^{j}=(U^{t})^{\dagger}U^{*}. Using the above notation, (29) decouples into nn equations of the form (1jn1\leq j\leq n):

where (Vt+1)(j)(V^{t+1})^{(j)} and (V)(j)(V^{*})^{(j)} denote the jthj^{\textrm{th}} rows of Vt+1V^{t+1} and VV^{*} respectively.

Using the above notation, we now provide a proof of Theorem 5.1 for the general rank-kk case.

Multiplying the update equation (29) on the left by (V)\left(V^{*}_{\perp}\right)^{{\dagger}}, we get: (V)V^t+1=(V)F(R(t+1))1.(V^{*}_{\perp})^{{\dagger}}\widehat{V}^{t+1}=-(V^{*}_{\perp})^{{\dagger}}F(R^{(t+1)})^{-1}. That is,

Now, similar to the sensing case (see Section 4.2) we break down our proof into the following two steps:

Bound F(Σ)12\left\|{F(\Sigma^{*})^{-1}}\right\|_{2} (Lemma 5.6) and

Bound ΣR(t+1)12\|\Sigma^{*}{R^{(t+1)}}^{-1}\|_{2}, i.e., the minimum singular value of (Σ)1R(t+1)\left(\Sigma^{*}\right)^{-1}R^{(t+1)} (Lemma 5.7).

Using Lemma 5.6 and Lemma 5.7, w.p. at least 11/n31-1/n^{3},

We now provide lemmas required by our above given proof. See Appendix C.3 for detailed proof of each of the lemmas.

We first provide a lemma to bound incoherence of Vt+1V^{t+1}, assuming incoherence of UtU^{t}.

Let M,Ω,pM,\Omega,p be as defined in Theorem 2.5. Also, let UtU^{t} be the tt-th step iterate obtained by (28). Let UtU^{t} be μ1=16σ1μkσk\mu_{1}=\frac{16\sigma_{1}^{*}\mu\sqrt{k}}{\sigma_{k}^{*}} incoherent. Then, w.p. at least 11/n31-1/n^{3}, iterate V(t+1)V^{(t+1)} is also μ1\mu_{1} incoherent.

We now bound the error term (FF) in AltMin update (29).

Let FF be the error matrix defined by (18) (also see (29)) and let UtU^{t} be a μ1\mu_{1}-incoherent orthonormal matrix obtained after (t1)th(t-1)^{\textrm{th}} update. Also, let MM, Ω\Omega, and pp satisfy assumptions of Theorem 2.5. Then, w.p. at least 11/n31-1/n^{3}:

Next, we present a lemma to bound (R(t+1))12\|(R^{(t+1)})^{-1}\|_{2}.

Let R(t+1)R^{(t+1)} be the lower-triangular matrix obtained by QR decomposition of V^t+1\widehat{V}^{t+1} ( see (29)) and let UtU^{t} be a μ1\mu_{1}-incoherent orthonormal matrix obtained after (t1)th(t-1)^{\textrm{th}} update. Also, let MM and Ω\Omega satisfy assumptions of Theorem 2.5. Then,

Lemma follows by exactly the same proof as that of Lemma 4.7 for the matrix sensing case. ∎

Stagewise AltMin Algorithm

In Section 4, we showed that if δ2k(σk)2(σ1)2k\delta_{2k}\leq\frac{(\sigma^{*}_{k})^{2}}{(\sigma^{*}_{1})^{2}k} then AltMinSense (Algorithm 1) recovers the underlying matrix. This means that, d=(σ1)4(σk)4k2nlognd=\frac{(\sigma^{*}_{1})^{4}}{(\sigma^{*}_{k})^{4}}k^{2}n\log n random Gaussian measurements (assume mnm\leq n) are required to recover MM. For matrices with large condition number (σ1/σk\sigma^{*}_{1}/\sigma^{*}_{k}), this would be significantly larger than the information theoretic bound of O(knlogn/k)O(kn\log n/k) measurements.

To alleviate this problem, we present a modified version of AltMinSense called Stage-AltMin. Stage-AltMin proceeds in kk stages where in the ii-th stage, a rank-ii problem is solved. The goal of the ii-th stage is to recover top ii-singular vectors of MM, up to O(σi+1)O(\sigma^{*}_{i+1}) error.

Recall that, the main problem with our analysis of AltMinSense is that if σiσi+1\sigma_{i}\gg\sigma_{i+1} (for some ii) then δ2k(σi+1)2(σi)2k\delta_{2k}\leq\frac{(\sigma^{*}_{i+1})^{2}}{(\sigma^{*}_{i})^{2}k} would need to be small. However, in such a scenario, the ii-th stage of Algorithm 3 can be thought of as solving a noisy sensing problem where the goal is to recover Mi=defU1:iΣ1:i(V1:i)M_{i}\stackrel{{\scriptstyle\textrm{def}}}{{=}}U^{*}_{1:i}\Sigma^{*}_{1:i}(V^{*}_{1:i})^{\dagger} using noisy measurements b=A(U1:iΣ1:i(V1:i)+N)b={\mathcal{A}}(U^{*}_{1:i}\Sigma^{*}_{1:i}(V^{*}_{1:i})^{\dagger}+N) where noise matrix N=defUi+1:kΣi+1:k(Vi+1:k)N\stackrel{{\scriptstyle\textrm{def}}}{{=}}U^{*}_{i+1:k}\Sigma^{*}_{i+1:k}(V^{*}_{i+1:k})^{\dagger}. Here MiM_{i} and NN represent the top ii singular components and last kik-i singular components of MM respectively. Hence, using noisy-case type analysis (see Section B.3) we show that the error MU^t(V^t)F\|M-\widehat{U}^{t}(\widehat{V}^{t})^{\dagger}\|_{F} decreases to O(σi+1)O(\sigma^{*}_{i+1}).

We now formally present the proof of our main result (see Theorem 2.3).

We prove the theorem using mathematical induction.

Base Case: After the -th step, error is: MF2j=1kσj2kσ12\|M\|_{F}^{2}\leq\sum_{j=1}^{k}\sigma_{j}^{2}\leq k\sigma_{1}^{2}. Hence, base case holds.

Induction Step: Here, assuming that the error bound holds for (i1)(i-1)-th stage, we prove the error bound for the ii-th stage.

Our proof proceeds in two steps. First, we show that the initial point U^1:i0\widehat{U}^{0}_{1:i}, V^1:i0\widehat{V}^{0}_{1:i} of the ii-th stage, obtained using Step 4, has c(σi)2+O(k(σi+1)2)c(\sigma^{*}_{i})^{2}+O\left(k(\sigma^{*}_{i+1})^{2}\right) error, with c<1c<1. In the second step, we show that using the initial points U^1:i0\widehat{U}^{0}_{1:i}, V^1:i0\widehat{V}^{0}_{1:i}, the AltMin algorithm iterations in the ii-th stage (Steps 6, 7) reduces the error to max(ϵ,16kσi+12)\max(\epsilon,16k\sigma_{i+1}^{2}).

We formalize the above mentioned first step in Lemma 6.1 and then prove the second step in Lemma 6.2. ∎

We now present two lemmas used by the above given proof. See Appendix B.4 for a proof of each of the lemmas.

Let assumptions of Theorem 2.3 be satisfied. Also, let U^1:i0\widehat{U}^{0}_{1:i}, V^1:i0\widehat{V}^{0}_{1:i} be the output of Step 4 of Algorithm 3. Then, assuming that MU^1:i1TV^1:i1TF216k(σi)2\|M-\widehat{U}^{T}_{1:i-1}\widehat{V}^{T}_{1:i-1}\|_{F}^{2}\leq 16k(\sigma^{*}_{i})^{2}, we obtain:

Let assumptions of Theorem 2.3 be satisfied. Also, let U^1:iT\widehat{U}^{T}_{1:i}, V^1:iT\widehat{V}^{T}_{1:i} be the TT-th step iterates of the ii-th stage of Algorithm 3. Then, assuming that MU^1:i0V1:i0F2j=i+1k(σj)2+1100(σi)2\|M-\widehat{U}^{0}_{1:i}V^{0}_{1:i}\|_{F}^{2}\leq\sum_{j=i+1}^{k}(\sigma^{*}_{j})^{2}+\frac{1}{100}(\sigma^{*}_{i})^{2}, we obtain:

where T=Ω(log(MF/ϵ))T=\Omega(\log(\|M\|_{F}/\epsilon)).

Summary and Discussion

Alternating minimization provides an empirically appealing and popular approach to solving several different low-rank matrix recovery problems. The main motivation, and result, of this paper was to provide the first theoretical guarantees on the global optimality of alternating minimization, for matrix completion and the related problem of matrix sensing. We would like to note the following aspects of our results and proofs:

For both the problems, we show that alternating minimization recovers the true matrix under similar problem conditions (RIP, incoherence) to those used by existing algorithms (based on convex optimization or iterated SVDs); computationally, our results show faster convergence to the global optima, but with possibly higher statistical (i.e. sample) complexity.

We develop a new framework for analyzing alternating minimization for low-rank problems. Key observation of our framework is that for some problems (under standard problem conditions) alternating minimization can be viewed as a perturbed version of the power method. In our case, we can control the perturbation error based on the extent of RIP / incoherence demonstrated by the problem. This idea is likely to have applications to other similar problems where trace-norm based convex relaxation techniques have rigorous theoretical results but alternating minimization has enjoyed more empirical success. For example, robust PCA , spectral clustering etc.

Our analysis also sheds light on two key aspects of the alternating minimization approach: Initialization: Due to its connection to power method, it is now easy to see that for alternating minimization to succeed, the initial iterate should not be orthogonal to the target vector. Our results indeed show that alternating minimization succeeds if the initial iterate is not “almost orthogonal” to the target subspace. This suggests that, selecting initial iterate smartly is preferable to random initialization. Dependence on the condition number: Our results for the alternating minimization algorithm depend on the condition number. However, using a stagewise adaptation of alternating minimization, we can remove this dependence for the matrix sensing problem. This suggests that (problem specific) modifications of the basic alternating minimization algorithm may in fact perform better than the original one, while (mostly) retaining the computational / implementational simplicity of the underlying method.

References

Appendix A Preliminaries

Let b=A(M)+eb={\mathcal{A}}(M)+e, where ee is a bounded error vector, MM is a rank-kk matrix and A{\mathcal{A}} is a linear measurement operator that satisfies 2k2k-RIP with constant δ2k\delta_{2k} (assume δ2k<1/3\delta_{2k}<1/3). Let Xt+1X^{t+1} be the t+1t+1-th step iterate of SVP, then the following holds:

In our analysis, we heavily use the following two results. The first result is the well-known Bernstein’s inequality.

The second result is a restatement of Theorem 3.1 from .

(Restatement of Theorem 3.1 from ) Suppose MM is an incoherent rank-kk matrix and let p,Ωp,\Omega be as in Theorem 2.5. Further, let MkM_{k} be the best rank-kk approximation of 1pPΩ(M)\frac{1}{p}P_{\Omega}\left(M\right). Then, w.h.p. we have:

Remark: Note that Theorem 3.1 from holds only for Tr(PΩ(M))T_{r}\left(P_{\Omega}(M)\right) where Tr(PΩ(M))T_{r}\left(P_{\Omega}(M)\right) is a trimmed version of PΩ(M)P_{\Omega}(M) obtained by setting all rows and columns of PΩ(M)P_{\Omega}(M) with too many observed entries to zero. However, using standard Chernoff bound we can argue that for our choice of pp, none of the rows and columns of PΩ(M)P_{\Omega}(M) have too many observed entries and hence Tr(PΩ(M))=PΩ(M)T_{r}\left(P_{\Omega}(M)\right)=P_{\Omega}(M), whp.

Appendix B Matrix Sensing: Proofs

The following is an alternate characterization of RIP that we use heavily in our proofs. At a conceptual level, it says that if A{\mathcal{A}} satisfies RIP, then it also preserves inner-product between any two rank-kk matrices (upto some additive error).

Consider the matrices X1=defU1V1TX_{1}\stackrel{{\scriptstyle\textrm{def}}}{{=}}U_{1}V_{1}^{T}, X2=defU2V2TX_{2}\stackrel{{\scriptstyle\textrm{def}}}{{=}}U_{2}V_{2}^{T} and X=X1+X2X=X_{1}+X_{2}. Since the rank of XX is at most 2k2k, we obtain the following using the RIP of A\mathcal{A}:

Concentrating on the second inequality, we obtain

A similar argument proves the other side of the inequality. This proves the lemma. ∎

We first show that the update (6) reduces to:

Let Err(V)=defi(Tr(AiM)Tr(AiU(t)V))2Err(V)\stackrel{{\scriptstyle\textrm{def}}}{{=}}\sum_{i}\left(\operatorname{Tr}\left(A_{i}M\right)-\operatorname{Tr}\left(A_{i}U^{(t)}V^{\dagger}\right)\right)^{2}. Since V^(t+1)\widehat{V}^{(t+1)} minimizes E(V)E(V), we have VE(V^(t+1))=0\nabla_{V}E(\widehat{V}^{(t+1)})=0.

where inverting BB is valid since the minimum singular value of BB is strictly positive (please refer Lemma B.2). Considering the pthp^{\textrm{th}} block of v^(t)\widehat{v}^{(t)}, we obtain

This gives us the following equation for V^(t)\widehat{V}^{(t)}:

where F=\left[\begin{array}[]{cccc}\left(B^{-1}\left(BD-C\right)Sv^{*}\right)_{1}&\left(B^{-1}\left(BD-C\right)Sv^{*}\right)_{2}&\cdots&\left(B^{-1}\left(BD-C\right)Sv^{*}\right)_{k}\end{array}\right].

Consider B=iAiut(ut)AiB=\sum_{i}A_{i}u^{t}(u^{t})^{\dagger}A_{i}^{\dagger}. Now, smallest eigenvalue of BB, i.e., λmin(B)\lambda_{min}(B) is given by:

where the last inequality follows using Lemma B.1. Using (38),

Now, consider G=u,utBC=iAi(u,utut(ut)ut(u))Ai=iAiut(u,ututu)AiG=\langle u^{*},u^{t}\rangle B-C=\sum_{i}A_{i}\left(\langle u^{*},u^{t}\rangle u^{t}(u^{t})^{\dagger}-u^{t}(u^{*})^{\dagger}\right)A_{i}^{\dagger}=\sum_{i}A_{i}u^{t}\left(\langle u^{*},u^{t}\rangle u^{t}-u^{*}\right)^{\dagger}A_{i}^{\dagger}. Using definition of the spectral norm:

where the last inequality follows by using Lemma B.1 and the fact that ut,(u,ututu)=0\langle u^{t},\left(\langle u^{*},u^{t}\rangle u^{t}-u^{*}\right)\rangle=0.

Lemma now follows using (37), (39), (40). ∎

B.2 Rank-k𝑘k Matrix Sensing

Since U^t\widehat{U}^{t} and U~t\widetilde{U}^{t} have full rank and span the same subspace, there exists a k×kk\times k, full rank matrix RR such that U^t=U~tR=UtRUtR\widehat{U}^{t}=\widetilde{U}^{t}R=U^{t}R_{U}^{t}R. We have:

Now, using RIP (see Definition 2.1) along with the above equation, we get:

Since ww was arbitrary, this proves the lemma. ∎

We calculate (BDC)pq\left(BD-C\right)_{pq} as follows:

Since ww and zz were arbitrary unit vectors, we can conclude that BDC2δ2kkdist(Ut,U)\left\|{BD-C}\right\|_{2}\leq\delta_{2k}\sqrt{k}\cdot dist(U^{t},U^{*}). Plugging this bound in (42) proves the lemma. ∎

Note that Σ(R(t+1))12σ1σmin(R(t+1))\|\Sigma^{*}(R^{(t+1)})^{-1}\|_{2}\leq\frac{\sigma^{*}_{1}}{\sigma_{\text{min}}(R^{(t+1)})}. Now,

Lemma now follows using above inequality with Lemma 4.6.

B.3 Noisy Matrix Sensing: Proofs

We now consider an extension of the matrix sensing problem where measurements can be corrupted arbitrarily using a bounded noise. That is, we observe b=A(M+N)b={\mathcal{A}}\left(M+N\right), where NN is the noise matrix. For this noisy case as well, we show that AltMinSense recovers MM upto an additive approximation depending on the Frobenius norm of NN.

Let MM and A()\mathcal{A}(\cdot) be as defined in Theorem 2.2. Suppose, AltMinSense algorithm (Algorithm 1) is supplied inputs A{\mathcal{A}}, b=A(M+N)b={\mathcal{A}}(M+N), where NN is the noise matrix s.t. NF<1100σk\left\|N\right\|_{F}<\frac{1}{100}\sigma^{*}_{k}. Then, after T=4log(2/ϵ)T=4\log(2/\epsilon) steps, iterates U^T\widehat{U}^{T}, V^T\widehat{V}^{T} of AltMinSense satisfy:

At a high level, our proof for noisy case follows closely, the exact case proof given in Section 4. That is, we show that the update of AltMinSense algorithm is similar to power-method type update but with two errors terms: one due to incomplete measurements and another due to the noise matrix.

Similar to our proof for sensing problem (Section 4), we analyze QR-decomposition based updates. That is,

Similar to Lemma 4.5, we can re-write the above given update equation as:

where, FF is the error matrix and is as defined in (18) and GG is the error matrix due to noise and is given by:

where BB, CC and DD defined in the previous section (See (13)) and CNC^{N} and DND^{N} are defined below:

Now, multiplying (44) with VV^{*}_{\perp}, we get:

where the last inequality follows using Lemma 4.6.

Now, we break down the proof in the following two steps:

Bound G2\left\|{G}\right\|_{2} (Lemma B.4, analogous to Lemma 4.6)

Bound (R(t+1))12\|({R^{(t+1)}})^{-1}\|_{2} (Lemma B.5, similar to Lemma 4.7)

Later in this section, we provide the above mentioned lemmas and their detailed proof.

As, U0U^{0} is obtained using SVD of iAibi\sum_{i}A_{i}b_{i}. Hence, using Lemma A.1, we have:

where last inequality follows using NFσk<1/100\frac{\|N\|_{F}}{\sigma^{*}_{k}}<1/100.

Theorem now follows using above equation with (47). ∎

Assuming conditions of Lemma B.4, we have the following bound on the minimum singular value of R(t)R^{(t)}:

Similar to the proof of Lemma 4.7, we have the following set of inequalities:

B.4 Stagewise Alternating Minimization for Matrix Sensing: Proofs

As the initial point of the ii-th stage is obtained by one step of SVP , using Lemma A.1, we obtain:

Now, by assumption over the (i1)(i-1)-th stage error (this assumption follows from the inductive hypothesis in proof of Theorem 2.3),

Lemma now follows by setting δ2k13200k\delta_{2k}\leq\frac{1}{3200k}. ∎

For our proof, we consider two cases: a) σiσi+1<5k\frac{\sigma^{*}_{i}}{\sigma^{*}_{i+1}}<5\sqrt{k}, b) σiσi+15k\frac{\sigma^{*}_{i}}{\sigma^{*}_{i+1}}\geq 5\sqrt{k}. Case (a): In this case, using monotonicity of the AltMin algorithm directly gives error bound. That is, MU^1:iT(V^1:iT)F2MU^1:i0V1:i0F2k(σi+1)2+25k100(σi+1)2.\|M-\widehat{U}^{T}_{1:i}(\widehat{V}_{1:i}^{T})^{{\dagger}}\|_{F}^{2}\leq\|M-\widehat{U}^{0}_{1:i}V^{0}_{1:i}\|_{F}^{2}\leq k(\sigma^{*}_{i+1})^{2}+\frac{25k}{100}(\sigma^{*}_{i+1})^{2}. Case (b): At a high level, if σiσi+15k\frac{\sigma^{*}_{i}}{\sigma^{*}_{i+1}}\geq 5\sqrt{k} then U1:i0U^{0}_{1:i} is “close” to U1:iU^{*}_{1:i} and hence the error bound follows by using an analysis similar to the noisy case. Note that σi+1\sigma^{*}_{i+1} being small implies that the “noise” is small. See Lemma B.6 for a formal proof of this case. ∎

Assume conditions given in Theorem 2.3 are satisfied and let σiσi+15k\frac{\sigma^{*}_{i}}{\sigma^{*}_{i+1}}\geq 5\sqrt{k}. Also, let

Then, U1:iTU^{T}_{1:i}, V1:iTV^{T}_{1:i} satisfy:

We first show that if σi\sigma_{i} and σi+1\sigma_{i+1} have large gap then   t\forall\;t, the ttht^{\textrm{th}} iterate of the ii-th stage, U^1:it\widehat{U}^{t}_{1:i} is close to U1:iU^{*}_{1:i}. Let UtU_{\perp}^{t} be a basis of the subspace orthogonal to U^1:it\widehat{U}^{t}_{1:i}.

where (ζ1)(\zeta_{1}) follows from the fact that lines 585-8 of Algorithm 3 never increases A(MU^1:it(V^1:it))2\left\|{{\mathcal{A}}\left(M-\widehat{U}^{t}_{1:i}(\widehat{V}^{t}_{1:i})^{\dagger}\right)}\right\|_{2}. Using (50), (51), and σiσi+15k\frac{\sigma^{*}_{i}}{\sigma^{*}_{i+1}}\geq 5\sqrt{k}, we obtain the following bound:

Now, we consider the update equation for V^t+1\widehat{V}^{t+1}:

Note that, the update is same as noisy case with noise matrix N=Ui+1:kΣi+1:k(Vi+1:k)N=U^{*}_{i+1:k}\Sigma^{*}_{i+1:k}(V^{*}_{i+1:k})^{\dagger} from (44):

where FF and GG are given by (18), (45). Multiplying (53) from the left by V=IVt+1(Vt+1)V_{\perp}^{{\dagger}}=I-V^{t+1}(V^{t+1})^{\dagger}, we obtain:

where the last inequality follows using the fact that σmin(A)BFABF\sigma_{\min}(A)\|B\|_{F}\leq\|AB\|_{F}. Using Lemma B.4, and a modification of Lemma 4.6, we get:

Using (54), (55), and the fact that σmin(UU1:i)=1UU1:i22\sigma_{\min}(U_{\perp}^{\dagger}U^{*}_{1:i})=\sqrt{1-\|U_{\perp}^{\dagger}U^{*}_{1:i}\|_{2}^{2}},

Assuming UU1:iΣ1:iF>2j=i+1kσj2\left\|{U_{\perp}^{{\dagger}}U^{*}_{1:i}\Sigma^{*}_{1:i}}\right\|_{\text{F}}>2\sqrt{\sum_{j=i+1}^{k}\sigma_{j}^{2}}, we obtain:

Using similar analysis, we can show that,

So after T8log(kσi)T\geq 8\log(k\sigma^{*}_{i}) iterations, we have:

Using the above inequality, we now bound the error after T8log(kσi)T\geq 8\log(k\sigma^{*}_{i}) iterations of the ii-th stage:

where (ζ1)(\zeta_{1}) follows from (53) and (ζ2)(\zeta_{2}) follows from (55). Using (57) and (58), we obtain the following bound:

Appendix C Matrix Completion: Proofs

Using Theorem 5.1, after O(log(1/ϵ))O(\log(1/\epsilon)) iterations, we get:

Now, using (29), the residual after tt-th step is given by:

where ζ1\zeta_{1} follows by Lemma 5.6 and setting δ2k\delta_{2k} appropriately. Theorem 2.5 now follows by setting ϵ=2kMFϵ\epsilon^{\prime}=2\sqrt{k}\|M\|_{F}\epsilon. ∎

We now provide the two results used in the above lemma.

After step 33 in Algorithm 2, whp we have:

From Theorem 3.1 in , we have the following result:

Let U(0)ΣVU^{(0)}\Sigma V^{{\dagger}} be the top kk singular components of MkM_{k}. We also have:

where (ζ1)(\zeta_{1}) follows from the fact that the column space of the first two terms in the equation is U(0)U_{\perp}^{(0)} where as the column space of the last two terms is U(0)U^{(0)}. Using the above two inequalities, we get:

if p>Ck4lognm(σ1)2(σk)2p>\frac{C^{\prime}k^{4}\log n}{m}\cdot\frac{(\sigma^{*}_{1})^{2}}{(\sigma^{*}_{k})^{2}} for a large enough constant CC^{\prime}. ∎

U~\widetilde{U} is incoherent with parameter 4μk4\mu\sqrt{k}.

Let uicu_{i}^{c} be the vector obtained by setting all the elements of uiu_{i} with magnitude greater than 2μkm\frac{2\mu\sqrt{k}}{\sqrt{m}} to zero and let uic=defuiuicu_{i}^{\overline{c}}\stackrel{{\scriptstyle\textrm{def}}}{{=}}u_{i}-u_{i}^{c}. Now, note that if for element jj of uiu_{i} we have uij>2μkm\left|u_{i}^{j}\right|>\frac{2\mu\sqrt{k}}{\sqrt{m}}, then, (uic)ju˘ij=u˘ijμkmuiju˘ij|(u_{i}^{c})^{j}-\breve{u}_{i}^{j}|=\left|\breve{u}_{i}^{j}\right|\leq\frac{\mu\sqrt{k}}{\sqrt{m}}\leq\left|u_{i}^{j}-\breve{u}_{i}^{j}\right|. Hence,

Let Uc=U~Λ1U^{c}=\widetilde{U}\Lambda^{-1} (QR decomposition). Then, for any uSpan(U)u_{\perp}^{*}\in\textrm{Span}(U_{\perp}^{*}) we have:

We now bound Λ2\left\|{\Lambda}\right\|_{2} as follows:

where we used the fact that d<116kd<\frac{1}{16k}. So we have:

Incoherence of U~\widetilde{U} follows using the following set of inequalities:

C.2 Rank-111 Matrix Completion: Proofs

As BB is a diagonal matrix, B12=1miniBii11δ2\|B^{-1}\|_{2}=\frac{1}{\min_{i}B_{ii}}\leq\frac{1}{1-\delta_{2}}, where the last inequality follows using Lemma C.3. The lemma now follows using the above observation and Lemma C.4.∎

Let M=σu(v)M=\sigma^{*}u^{*}(v^{*})^{\dagger}, pp, Ω\Omega, utu^{t} be as defined in Lemma 5.3. Then, w.p. at least 11n31-\frac{1}{n^{3}},

Using union bound (for all jj) and for p9μ12lognmδ22p\geq\frac{9\mu_{1}^{2}\log n}{m\delta_{2}^{2}}, w.p. 11n31-\frac{1}{n^{3}}: j,ut,uδ2Zjut,u+δ2\forall j,\langle u^{t},u^{*}\rangle-\delta_{2}\leq Z_{j}\leq\langle u^{t},u^{*}\rangle+\delta_{2}. ∎

Let M=σu(v)M=\sigma^{*}u^{*}(v^{*})^{\dagger}, pp, Ω\Omega, utu^{t} be as defined in Lemma 5.3. Then, w.p. at least 11n31-\frac{1}{n^{3}},

where C>0C>0 is a global constant and (ζ1)(\zeta_{1}) follows by using a modified version of Lemma 6.1 by (see Lemma C.5) and (ζ2)(\zeta_{2}) follows by using incoherence of vv^{*} and utu^{t}. Lemma now follows by observing that maxx,x2=1x(u,utBC)v=(u,utBC)v2\max_{x,\|x\|_{2}=1}x^{\dagger}(\langle u^{*},u^{t}\rangle B-C)v^{*}=\|(\langle u^{*},u^{t}\rangle B-C)v^{*}\|_{2} and p>Cμ12lognmδ22p>\frac{C\mu_{1}^{2}\log n}{m\delta_{2}^{2}}. ∎

Using (25) and using the fact that B,CB,C are diagonal matrices:

We bound the largest magnitude of elements in v^t+1\widehat{v}^{t+1} as follows. For every j[n]j\in[n], we have:

where (ζ1)(\zeta_{1}) follows from the fact that 1δ2Bjj1+δ21-\delta_{2}\leq B_{jj}\leq 1+\delta_{2} and Cjj(ut,u+δ2)\left|C_{jj}\right|\leq\left(\left|\langle u^{t},u^{*}\rangle\right|+\delta_{2}\right) (please refer Lemma C.3).

C.3 General Rank-k𝑘k Matrix Completion: Proofs

From the decoupled update equation, (30), we obtain:

We bound the two norm of the (Vt+1)(j)(V^{t+1})^{(j)} as follows:

where we used the following inequalities in (ζ1)(\zeta_{1}):

where (62) follows from the incoherence of VV^{*}, (63) follows from from an analysis similar to the proof of Lemma 4.7, (64) follows from (the proof of) Lemma C.6, (65) follows from Lemma C.7 and finally (66) follows from the fact that Dj=(Ut)UD^{j}=\left(U^{t}\right)^{{\dagger}}U^{*} with UtU^{t} and UU^{*} being orthonormal column matrices. ∎

where the last inequality follows using Lemma C.6 and Lemma C.8. ∎

We now bound B12\|B^{-1}\|_{2} and Cj2\|C^{j}\|_{2}, which is required by our bound for FF as well as for our incoherence proof.

Let M,Ω,p,M,\Omega,p, and UtU^{t} be as defined in Theorem 2.5 and Lemma 5.6. Then, w.p. at least 11n31-\frac{1}{n^{3}}:

Lemma would follow using the bound on σmin(Bj),j\sigma_{min}(B^{j}),\forall j that we show below.

That is, by using pp as in the statement of the lemma with the above equation and using union bound, we get (w.p. >11/n3>1-1/n^{3}): w,j  wBjw1δ2k\forall w,j\ \ w^{\dagger}B^{j}w\geq 1-\delta_{2k}. That is, j,σmin(Bj)(1δ2k)\forall j,\sigma_{min}(B^{j})\geq(1-\delta_{2k}). ∎

Finally, we provide a lemma to bound the second part of the error term (FF).

Let M,Ω,p,M,\Omega,p, and UtU^{t} be as defined in Theorem 2.5 and Lemma 5.6. Then, w.p. at least 11n31-\frac{1}{n^{3}}:

where v=vec(V)v^{*}=vec(V^{*}), i.e. v=[V1Vk]v^{*}=\left[\begin{matrix}V^{*}_{1}\\ \vdots\\ V^{*}_{k}\end{matrix}\right].

Let ui=(Ut)(i)u^{i}=(U^{t})^{(i)} and u(i)=(U)(i)u^{*(i)}=(U^{*})^{(i)}. Also, let Hj=(BjDCj)H^{j}=(B^{j}D-C^{j}), i.e.,

Now, x(BDC)v=j(xj)(BjDCj)(V)(j)=1ppq(i,j)ΩxpjVjq(Hij)pqx^{\dagger}(BD-C)v^{*}=\sum_{j}(x^{j})^{\dagger}(B^{j}D-C^{j})(V^{*})^{(j)}=\frac{1}{p}\sum_{pq}\sum_{(i,j)\in\Omega}x^{j}_{p}V^{*}_{jq}(H^{j}_{i})_{pq}. Also, using (71), (p,q)\forall(p,q):

Hence, applying Lemma C.5, we get w.p. at least 11n31-\frac{1}{n^{3}}:

Using (72), (73) and incoherence of VV^{*}, we get (w.p. 11/n31-1/n^{3}), x\forall x:

where we used the fact that pxp2kx2=k\sum_{p}\left\|{x_{p}}\right\|_{2}\leq\sqrt{k}\left\|{x}\right\|_{2}=\sqrt{k} in the last step. Lemma now follows by observing maxx,x=1x(BDC)v=(BDC)v2\max_{x,\|x\|=1}x^{\dagger}(BD-C)v^{*}=\|(BD-C)v^{*}\|_{2}. ∎