Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, Benjamin Recht

Introduction

Low rank models are ubiquitous in machine learning, and over a decade of research has been dedicated to determining when such models can be efficiently recovered from partial information [Faz02, RS05, CR09]. See [DR16] for an extended survey on this topic. The simplest such recovery problem concerns how can we can find a low-rank matrix obeying a set of linear equations? What is the computational complexity of such an algorithm? More specifically, we are interested in solving problems of the form

Since the early seventies, a popular heuristic for solving such problems has been to replace M\bm{M} with a low-rank factorization M=UVT\bm{M}=\bm{U}\bm{V}^{\mathsf{T}} and solve matrix bilinear equations of the form

via a local search heuristic [Ruh74]. Many researchers have demonstrated that such heuristics work well in practice for a variety of problems [RS05, Fun06, LRS+10, RR13]. However, these procedures lack strong guarantees associated with convex programming heuristics for solving (1.1).

In this paper we show that a local search heuristic solves (1.2) under standard restricted isometry assumptions on the linear map A\mathcal{A}. For standard ensembles of equality constraints, we demonstrate that M\bm{M} can be estimated by such heuristics as long as we have Ω((n1+n2)r)\Omega((n_{1}+n_{2})r) equations.Here and throughout we use f(x)=Ω(g(x))f(x)=\Omega(g(x)) if there is a positive constant CC such that f(x)Cg(x)f(x)\geq Cg(x) for all xx sufficiently large. This is merely a constant factor more than the number of parameters needed to specify a n1×n2n_{1}\times n_{2} rank rr matrix. Specialized to a random Gaussian model and positive semidefinite matrices, our work improves upon recent independent work by Zheng and Lafferty [ZL15].

Algorithms

In this paper we study a local search heuristic for solving matrix bilinear equations of the form (1.2) which consists of two components: (1) a careful initialization obtained by a projected gradient scheme on n1×n2n_{1}\times n_{2} matrices, and (2) a series of successive refinements of this initial solution via a gradient descent scheme. This algorithm is a natural extension of the Wirtinger Flow algorithm developed in [CLS15] for solving vector quadratic equations. Following [CLS15], we shall refer to the combination of these two steps as the Procrustes Flow (PF) algorithm. We shall describe two variants of our algorithm based on whether the sought after solution M\bm{M} is positive semidefinite or not. The former is detailed in Algorithm 1, and the latter in Algorithm 2.

The initialization phase of both variants is rather similar and is described in Section 2.1. The successive refinement phase is explained in Section 2.2 for positive semidefinite (PSD) matrices and in Section 2.3 for arbitrary matrices. Throughout this paper when describing the PSD case, we assume the size of the matrix is M\bm{M} is n×nn\times n, i.e. n1=n2=nn_{1}=n_{2}=n.

In the initial phase of our algorithm we start from M~0=0n1×n2\widetilde{\bm{M}}_{0}=\bm{0}_{n_{1}\times n_{2}} and apply successive updates of the form

on rank rr matrices of size n1×n2n_{1}\times n_{2}. Here, Pr\mathcal{P}_{r} denotes projection onto either rank-rr matrices or rank-rr PSD matrices, both of which can be computed efficiently via Lanczos methods. We run (2.1) for T0T_{0} iterations and use the resulting matrix MT0\bm{M}_{T_{0}} for initialization purposes. In the PSD case, we set our initialization to an n×rn\times r matrix U0\bm{U}_{0} obeying M~T0=U0U0T\widetilde{\bm{M}}_{T_{0}}=\bm{U}_{0}\bm{U}_{0}^{\mathsf{T}}. In the more general case of rectangular matrices we need to use two factors. Let M~T0=CT0ΣT0DT0T\widetilde{\bm{M}}_{T_{0}}=\bm{C}_{T_{0}}\bm{\Sigma}_{T_{0}}\bm{D}_{T_{0}}^{\mathsf{T}} be the Singular Value Decomposition (SVD) of M~T0\widetilde{\bm{M}}_{T_{0}}. We initialize our algorithm in the rectangular case by setting U0=CT0ΣT01/2\bm{U}_{0}=\bm{C}_{T_{0}}\bm{\Sigma}_{T_{0}}^{1/2} and V0=DT0ΣT01/2\bm{V}_{0}=\bm{D}_{T_{0}}\bm{\Sigma}_{T_{0}}^{1/2}.

Updates of the form (2.1) have a long history in compressed sensing/matrix sensing literature (see e.g. [TG07, GK09, NT09, NV09, BD09, MJD09, CCS10]). Furthermore, using the first step of the update (2.1) for the purposes of initialization has also been proposed in previous work (see e.g. [AM07, KMO10, JNS13]).

2 Successive refinement via gradient descent – positive semidefinite case

3 Successive refinement via gradient descent – general case

Note that this is similar to (2.2) but adds a regularizer to measure mismatch between U\bm{U} and V\bm{V}. Given a factorization M=UVT\bm{M}=\bm{U}\bm{V}^{\mathsf{T}}, for any invertible r×rr\times r matrix P\bm{P}, UP\bm{U}\bm{P} and VPT\bm{V}\bm{P}^{-\mathsf{T}} is also a valid factorization. The purpose of the second term in (2.4) is to account for this redundancy and put the two factors on “equal footing”. To solve (2.4), starting from our initial estimates U0\bm{U}_{0} and V0\bm{V}_{0} we apply the successive updates

Again, (2.3) and (2.3) are essentially gradient descent with a carefully chosen step size.

Main Results

For our theoretical results we shall focus on affine maps A\mathcal{A} which obey the matrix Restricted Isometry Property (RIP).

The map A\mathcal{A} satisfies rr-RIP with constant δr\delta_{r}, if

We note that this distance is the solution to the classic orthogonal Procrustes problem (hence the name of the algorithm). It is known that the optimal rotation matrix R\bm{R} minimizing UXRF\left\|\bm{U}-\bm{XR}\right\|_{F} is equal to R=ABT\bm{R}=\bm{A}\bm{B}^{\mathsf{T}}, where AΣBT\bm{A}\bm{\Sigma}\bm{B}^{\mathsf{T}} is the singular value decomposition (SVD) of XTU\bm{X}^{\mathsf{T}}\bm{U}. We now have all of the elements in place to state our main results.

Furthermore, take a constant step size μτ=μ\mu_{\tau}=\mu for all τ=1,2,\tau=1,2,\ldots, with μ36/425\mu\leq 36/425. Then, starting from any initial solution obeying (3.3), the τ\tau-th iterate of Algorithm 1 satisfies

2 Bilinear measurements

Furthermore, take a constant step size μτ=μ\mu_{\tau}=\mu for all τ=1,2,\tau=1,2,\ldots and assume μ2/187\mu\leq 2/187. Then, starting from any initial solution obeying (3.6), the τ\tau-th iterate of Algorithm 2 satisfies

The above theorem shows that Procrustes Flow algorithm achieves a good initialization under the RIP assumptions on the mapping A\mathcal{A}. Also, starting from any sufficiently accurate initialization the algorithm exhibits geometric convergence to the unknown matrix M\bm{M}. We note that in the above result we have not attempted to optimize the constants. Furthermore, there is a natural tradeoff involved between the upper bound on the RIP constant, the radius in which PF is contractive (3.6), and its rate of convergence (3.7). In particular, as it will become clear in the proofs one can increase the radius in which PF is contractive (increase the constant 1/41/4 in (3.6)) and the rate of convergence (increase the constant 4/254/25 in (3.7)) by assuming a smaller upper bound on the RIP constant.

The most common measurement ensemble which satisfies the isotropy and RIP assumptions is the Gaussian ensemble here each matrix Ak\bm{A}_{k} has i.i.d. N(0,1/m)\mathcal{N}(0,1/m) entries.We note that in the PSD case the so called spiked Gaussian ensemble would be the right equivalent. In this case each symmetric matrix Ak\bm{A}_{k} has N(0,1/m)\mathcal{N}(0,1/m) entries on the diagonal and N(0,1/2m)\mathcal{N}(0,1/2m) entries elsewhere. For this ensemble to achieve a RIP constant of δr\delta_{r}, we require at least m=Ω(1δr2nr)m=\Omega(\frac{1}{\delta_{r}^{2}}nr) measurements. Using equation (3.7) together with a simple calculation detailed in Appendix D, we can conclude that for Mτ=UτVτT\bm{M}_{\tau}=\bm{U}_{\tau}\bm{V}_{\tau}^{\mathsf{T}}, we have

Thus, applying Theorem 3.3 to this measurement ensemble, we conclude that the Procrustes Flow algorithm yields a solution with relative error (MτMF/MFϵ\left\|\bm{M}_{\tau}-\bm{M}\right\|_{F}/\left\|\bm{M}\right\|_{F}\leq\epsilon) in O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) iterations using only Ω(nr)\Omega(nr) measurements. We would like to note that if more measurements are available it is not necessary to use multiple projected gradient updates in the initialization phase. In particular, for the Gaussian model if m=Ω(nr2κ2)m=\Omega(nr^{2}\kappa^{2}), then (3.3) will hold after the first iteration (T0=1T_{0}=1).

Theorems 3.2 and 3.3 require that T0=Ω(log(rκ))T_{0}=\Omega(\log(\sqrt{r}\kappa)), but κ\kappa is a property of M\bm{M} and is hence unknown. However, under the same hypotheses regarding the RIP constant in Theorems 3.2 and 3.3, we can use each iterate of initialization to test whether or not we have entered the radius of convergence. The following lemma establishes a sufficient condition we can check using only information from M~τ\widetilde{\bm{M}}_{\tau}. We establish this result only in the symmetric case– the extension to the general case is straightforward. The proof is deferred to Appendix B.

One might consider using solely the projected gradient updates (i.e. set T0=T_{0}=\infty) as in previous approaches [TG07, GK09, NT09, NV09, BD09, MJD09, CCS10]. We note that the projected gradient updates in the initialization phase require computing the first rr singular vectors of a matrix whereas the gradient updates do not require any singular vector computations. Such singular computations may be prohibitive compared to the gradient updates, especially when n1n_{1} or n2n_{2} is large and for ensembles where matrix-vector multiplication is fast. We would like to emphasize, however, that for small n1,n2n_{1},n_{2} and dense matrices using projected gradient updates may be more efficient. Our scheme is a natural interpolation: one could only do projected gradient steps, or one could do one projected gradient step. Here we argue that very few projected gradients provide sufficient initialization such that gradient descent converges geometrically.

Related work

There is a vast literature dedicated to low-rank matrix recovery/sensing and semidefinite programming. We shall only focus on the papers most related to our framework.

Recht, Fazel, and Parrilo were the first to study low-rank solutions of linear matrix equations under RIP assumptions [RFP10]. They showed that if the rank-rr RIP constant of A\mathcal{A} is less than a fixed numerical constant, then the matrix with minimum trace satisfying the equality constraints coincided with the minimum rank solution. In particular, for the Gaussian ensemble the required number of measurements is Ω(nr)\Omega(nr) [CP11]. Subsequently, a series of papers [CR09, Gro11, Rec11, CLS14] showed that trace minimization and related convex optimization approaches also work for other measurement ensembles such as those arising in matrix completion and related problems. In this paper we have established a similar result to [RFP10]. We require the same order of measurements Ω(nr)\Omega(nr) but use a more computationally friendly local search algorithm. Also related to this work are projection gradient schemes with hard thresholding [TG07, GK09, NT09, NV09, BD09, MJD09, CCS10]. Such algorithms enjoy similar guarantees to that of [RFP10] and this work. Indeed, we utilize such results in the initialization phase of our algorithm. However, such algorithms require a rank-rr SVD in each iteration which may be expensive for large problem sizes. We would like to emphasize, however, that for small problem sizes and dense matrices (such as Gaussian ensembles) such algorithms may be faster than gradient descent approaches such as ours.

Our algorithm and analysis are inspired by the recent paper [CLS15] by Candes, Li and Soltanolkotabi. See also [Sol14, CLM15] for some stability results. In [CLS15] the authors introduced a local regularity condition to analyze the convergence of a gradient descent-like scheme for phase retrieval. We use a similar regularity condition but generalize it to ranks higher than one. Recently, independent of our work, Zheng and Lafferty [ZL15] provided an analysis of gradient descent using (2.2) via the same regularity condition. Zheng and Lafferty focus on the Gaussian ensemble, and establish a sample complexity of m=Ω(nr3κ2logn)m=\Omega(nr^{3}\kappa^{2}\log n). In comparison we only require Ω(nr)\Omega(nr) measurements removing both the dependence on κ\kappa in the sample complexity and improving the asymptotic rate. We would like to emphasize that the improvement in our result is not just due to the more sophisticated initialization scheme. In particular, Zheng and Lafferty show geometric convergence starting from any initial solution obeying dist(U0,X)cσr(X)\left(\bm{U}_{0},\bm{X}\right)\leq c\cdot\sigma_{r}(\bm{X}) as long as the number of measurements obeys m=Ω(nrκ2logn)m=\Omega(nr\kappa^{2}\log n). In contrast, we establish geometric convergence starting from the same neighborhood of U0\bm{U}_{0} with only Ω(nr)\Omega(nr) measurements. Our results also differs in terms of the convergence rate. We establish a convergence rate of the form 1μκ1-\frac{\mu}{\kappa} whereas [ZL15] establishes a slower convergence rate of the form 1μnr2κ21-\frac{\mu}{nr^{2}\kappa^{2}}. Moreover, the theory of restricted isometries in our work considerably simplifies the analysis.

Finally, we would also like to mention [SOR15] for guarantees using stochastic gradient algorithms. The results of [SOR15] are applicable to a variety of models; focusing on the Gaussian ensemble, the authors require Ω((nrlogn)/ϵ)\Omega\left((nr\log n)/\epsilon\right) samples to reach a relative error of ϵ\epsilon. In contrast, our sample complexity is independent of the desired relative error ϵ\epsilon. However, their algorithm only requires a random initialization.

Since the first version of this paper appeared on arXiv, a few recent papers have also studied low-rank recovery from RIP measurements via Procrustes Flow type schemes [BKS15, ZWL15, CW15]. We would like to point out that the results presented in these papers are suboptimal compared to ours. For example, by utilizing some of the results of the previous version of this paper, [BKS15] provides a similar convergence rate to ours. However, this convergence occurs in a smaller radius around the planted solution so that the required number of measurements is significantly higher. Furthermore, the results of [BKS15] only apply when the matrix is PSD and do not work for general rectangular. Similarly, result in [CW15] holds only for PSD matrices, and the convergence rate has a high-degree polynomial dependence on condition number. The algorithm from [CW15] does generalize to rectangular matrices, but the sample complexity is of the order of O(nr3logn)\mathcal{O}(nr^{3}\log n) rather than the complexity O(nr)\mathcal{O}(nr) we establish here. Moreover, our analysis of both the PSD and rectangular cases is far more concise.

Proofs

We first prove our results for the symmetric PSD case (Theorem 3.2). However, whenever possible we will state lemmas in the more general setting. The changes required for the proof of the general setting (Theorem 3.3) is deferred to Section 5.4.

Thus, any result proven for the update (5.1) will automatically carry over to the PF update with a simple rescaling of the upper bound on the step size via (5.3). Furthermore, we can upper bound the convergence rate of gradient descent using the PF update in terms of properties of X\bm{X} instead of U0\bm{U}_{0} via (5.4).

We start with a well known characterization of RIP.

[Can08] Let A\mathcal{A} satisfy 2r2r-RIP with constant δ2r\delta_{2r}. Then, for all matrices X,Y\bm{X},\bm{Y} of rank at most rr, we have

holds. Here, ρ(A)\rho(\mathcal{A}) is defined as

We would like to point out that the dependence on σr2(X)\sigma_{r}^{2}(\bm{X}) in the lemma above is unavoidable.

2 Proof of convergence of gradient descent updates (Equation (3.4))

We first outline the general proof strategy. See Sections 2.3 and 7.9 of [CLS15] for related arguments. We first will show that gradient descent on an approximate estimate of the function ff converges. The approximate function we use is F(U):=14UUTXXTF2F(\bm{U}):=\frac{1}{4}\left\|\bm{UU}^{\mathsf{T}}-\bm{XX}^{\mathsf{T}}\right\|_{F}^{2}. When the map A\mathcal{A} is random and isotropic in expectation, F(U)F(\bm{U}) can be interpreted as the expected value of f(U)f(\bm{U}), but we stress that our result is a purely deterministic result. We demonstrate that F(U)F(\bm{U}) exhibits geometric convergence in a small neighborhood around X\bm{X}. The standard approach in optimization to show this is to prove that the function exhibits strong convexity. However, due to the rotational degrees of freedom for any optimal point, it is not possible for F(U)F(\bm{U}) to be strongly convex in any neighborhood around X\bm{X} except in the special case when r=1r=1. Thus, we rely on the approach used by [CLS15], which establishes a sufficient condition that only relies on first-order information along certain trajectories. After showing the sufficient condition holds on F(U)F(\bm{U}), we use standard RIP results to show that this condition also holds for the function f(U)f(\bm{U}).

To begin our analysis, we start with the following formulas for the gradient of f(U)f(\bm{U}) and F(U)F(\bm{U})

with the dependence on U\bm{U} omitted for sake of exposition. The following definition defines a notion of strong convexity along certain trajectories of the function.

The function ff satisfies a regularity condition, denoted by RC(α,β,δ)\mathsf{RC}(\alpha,\beta,\delta), if for all matrices UB(δ)\bm{U}\in B(\delta) the following inequality holds:

If a function satisfies RC(α,β,δ)\mathsf{RC}(\alpha,\beta,\delta), then as long as gradient descent starts from a point U0B(δ)\bm{U}_{0}\in B(\delta), it will have a geometric rate of convergence to the optimum X\bm{X}. This is formalized by the following lemma.

[CLS15] If ff satisfies RC(α,β,δ)\mathsf{RC}(\alpha,\beta,\delta) and U0B(δ)\bm{U}_{0}\in B(\delta), then the gradient descent update

with step size 0<μ2/β0<\mu\leq 2/\beta obeys UτB(δ)\bm{U}_{\tau}\in B(\delta) and

The proof is complete by showing that the regularity condition holds. To this end, we first show in Lemma 5.7 below that the function F(U)F(\bm{U}) satisfies a slightly stronger variant of the regularity condition from Definition 5.5. We then show in Lemma 5.8 that the gradient of ff is always close to the gradient of FF, and in Lemma 5.9 that the gradient of ff is Lipschitz around the optimal value X\bm{X}.

Let F(U)=14UUTXXTF2F(\bm{U})=\frac{1}{4}\left\|\bm{UU}^{\mathsf{T}}-\bm{XX}^{\mathsf{T}}\right\|_{F}^{2}. For all U\bm{U} obeying

We shall prove these three lemmas in Sections 5.2.1, 5.2.2, and 5.2.3. However, we first explain how the regularity condition follows from these three lemmas. To begin, note that

where (a) holds from Cauchy-Schwarz followed by Lemma 5.8, using the fact that δ6r110\delta_{6r}\leq\frac{1}{10} as assumed in the statement of Theorem 3.2 and (b) follows from 2aba2+b22ab\leq a^{2}+b^{2}.

Combining (5.6) with Lemma 5.7 for any U\bm{U} obeying UXR14σr(X)\left\|\bm{U}-\bm{XR}\right\|\leq\frac{1}{4}\sigma_{r}(\bm{X}), we have

which shows that UTXR\bm{U}^{\mathsf{T}}\bm{XR} is a symmetric PSD matrix. Furthermore, note that since

we can conclude that HTXR\bm{H}^{\mathsf{T}}\bm{XR} is symmetric. To avoid carrying R\bm{R} in our equations we perform the change of variable XXR\bm{X}\leftarrow\bm{X}\bm{R}. That is, without loss of generality we assume R=I\bm{R}=\bm{I} and that UTX0\bm{U}^{\mathsf{T}}\bm{X}\succeq 0 and HTX=XTH\bm{H}^{\mathsf{T}}\bm{X}=\bm{X}^{\mathsf{T}}\bm{H}.

Note that for any U\bm{U} obeying dist(U,X)14X(\bm{U},\bm{X})\leq\frac{1}{4}\left\|\bm{X}\right\| we have

Using the latter along with the simplifications discussed above, to prove Lemma (5.5) it suffices to prove

Equation (5.8) can equivalently be written in the form

Here the constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} are defined as

Since c1<c224c3c_{1}<\frac{c_{2}^{2}}{4c_{3}}, to prove (5.9) it thus suffices to require that

2.2 Proof of gradient concentration (Lemma 5.8)

Define Δ:=UUTXXT\bm{\Delta}:=\bm{UU}^{\mathsf{T}}-\bm{XX}^{\mathsf{T}}. Then,

where (a) follows from Lemma 5.1, since rank(Δ)2r\operatorname*{rank}(\bm{\Delta})\leq 2r and rank(HUT)r\operatorname*{rank}(\bm{H}\bm{U}^{\mathsf{T}})\leq r. This proves the first part of the lemma. To prove the second part, by the variational form of the Frobenius norm, we have

The result now follows from HUTFHFUU\left\|\bm{H}\bm{U}^{\mathsf{T}}\right\|_{F}\leq\left\|\bm{H}\right\|_{F}\left\|\bm{U}\right\|\leq\left\|\bm{U}\right\|.

2.3 Proof of Lipschitz gradients around optimal solution (Lemma 5.9)

Define Δ:=UUTXXT\bm{\Delta}:=\bm{UU}^{\mathsf{T}}-\bm{XX}^{\mathsf{T}} and δ:=δ6r\delta:=\delta_{6r}. Suppose we show that

where γ:=1/2U2\gamma:=1/2\lVert\bm{U}\rVert^{2}. Then by 2r2r-RIP we have

which yields the claim after rearranging.

We now focus on proving Equation (5.10). Recall f(U)=AA(Δ)U\nabla f(\bm{U})=\mathcal{A}^{*}\mathcal{A}(\bm{\Delta})\cdot\bm{U}, which implies f(U)F2=A(Δ),A(AA(Δ)UUT)\lVert\nabla f(\bm{U})\rVert_{F}^{2}=\langle\mathcal{A}(\bm{\Delta}),\mathcal{A}(\mathcal{A}^{*}\mathcal{A}(\bm{\Delta})\cdot\bm{UU}^{\mathsf{T}})\rangle. Using this equality we have

where both (a) and (b) hold by Lemma 5.1 since rank(Δ)2r\operatorname*{rank}(\bm{\Delta})\leq 2r, rank(ΔγAA(Δ)UUT)3r\operatorname*{rank}(\bm{\Delta}-\gamma\mathcal{A}^{*}\mathcal{A}(\bm{\Delta})\cdot\bm{UU}^{\mathsf{T}})\leq 3r, and rank(ΔUUT)r\operatorname*{rank}(\bm{\Delta}\bm{UU}^{\mathsf{T}})\leq r. We now control ΔγAA(Δ)UUTF\lVert\bm{\Delta}-\gamma\mathcal{A}^{*}A(\bm{\Delta})\cdot\bm{UU}^{\mathsf{T}}\rVert_{F} from above. Using the variational form of the Frobenius norm,

where (a) holds again by Lemma 5.1 since rank(VUUT)r\operatorname*{rank}(\bm{V}\bm{UU}^{\mathsf{T}})\leq r, and (b) holds since VUUTFVFUUTUUT\lVert\bm{V}\bm{UU}^{\mathsf{T}}\rVert_{F}\leq\left\|\bm{V}\right\|_{F}\lVert\bm{UU}^{\mathsf{T}}\rVert\leq\lVert\bm{UU}^{\mathsf{T}}\rVert. Plugging this upper bound into Equation (5.11), we have

where (a) holds since ΔUUT,ΔU2Δ,Δ\langle\bm{\Delta}\bm{UU}^{\mathsf{T}},\bm{\Delta}\rangle\leq\lVert\bm{U}\rVert^{2}\langle\bm{\Delta},\bm{\Delta}\rangle. Using the fact that δ1/10\delta\leq 1/10 (5.12) implies,

For γ=12U2\gamma=\frac{1}{2\lVert\bm{U}\rVert^{2}}, we have

Plugging this bound into Equation (5.13), we get

3 Proof of initialization (Equation (3.3))

Using Lemma 5.1, we can conclude that ρ(A)\rho(\mathcal{A}) from Lemma 5.2 is bounded by ρ(A)2δ4r1/5\rho(\mathcal{A})\leq 2\delta_{4r}\leq 1/5. Setting M~0=0n×n\widetilde{\bm{M}}_{0}=\bm{0}_{n\times n} and applying Lemma 5.2 to our initialization iterates, we have that

Hence, if we want the RHS to be upper bounded by 14σr(X)\frac{1}{4}\sigma_{r}(\bm{X}), we require

Since XFrX\left\|\bm{X}\right\|_{F}\leq\sqrt{r}\left\|\bm{X}\right\|, it is enough to require that

Similarly, it is easy to check that if τ\tau satisfies (5.14), then

4 Proof for rectangular matrices (Theorem 3.3)

To simplify exposition we aggregate the pairs of matrices (U,V)(\bm{U},\bm{V}), (Uτ,Vτ)(\bm{U}_{\tau},\bm{V}_{\tau}), (X,Y)(\bm{X},\bm{Y}), and (X,Y)(\bm{X},-\bm{Y}) into larger “lifted” matrices as follows

In this lifted space the function gg takes the form

Note that the updates of the Procrustes Flow algorithm in Equations (2.3) and (2.3) are based on the gradients Ug(U,V)\nabla_{\bm{U}}g(\bm{U},\bm{V}) and Vg(U,V)\nabla_{\bm{V}}g(\bm{U},\bm{V}) given by

One can easily verify that this update has the following compact representation in terms of the lifted space

As in the proof for the PSD case, the crux of Theorem 3.3 lies in establishing that the regularity condition

To prove (5.15), we make use of the similarity of the expressions with the PSD case. We start, as before, by defining a reference function F(W):=14WWTZZTF2F(\bm{W}):=\frac{1}{4}\left\|\bm{WW}^{\mathsf{T}}-\bm{ZZ}^{\mathsf{T}}\right\|_{F}^{2} with gradient F(W)=(WWTZZT)W\nabla F(\bm{W})=(\bm{WW}^{\mathsf{T}}-\bm{ZZ}^{\mathsf{T}})\bm{W}. We now state two lemmas relating gg and FF, which together immediately imply (5.15). The first lemma relates the regularity condition of gg to that of FF by utilizing RIP. The second lemma provides a Lipschitz type property for the gradient of gg.

With these lemmas in place we have all the elements to prove (5.15). We use Lemma 5.10, Equation (5.16) together with the inequality 2aba2+b22ab\leq a^{2}+b^{2} to conclude that,

Applying Lemma 5.11 together with δ4r1/25\delta_{4r}\leq 1/25 to (5.4) completes the proof of (5.15) and hence the theorem. All that remains is to prove Lemma 5.10 and 5.11, which we do in Sections 5.4.1 and 5.4.2, respectively.

We begin the proof of Lemma 5.10 with the following RIP inequality about the map B\mathcal{B}. The proof of this lemma is almost identical to the proof of Lemma 5.1, so we omit the details.

Taking inner products of both sides of Equation (5.20) gives us

The first term is simple to control with RIP. Observe that

We now relate the second term to the gradient of FF. By exploiting the structure of Z\bm{Z} and Z~\widetilde{\bm{Z}}, we have

4.2 Lipschitz-gradient type condition for g𝑔g (Lemma 5.11)

The left-hand side of (5.17) has two terms. We start by bounding the second term. Fix any ε>0\varepsilon>0. Then,

Here, (a) holds since by Young’s inequality for any ε>0\varepsilon>0, we have (ab)2ε1+εa2εb2(a-b)^{2}\geq\frac{\varepsilon}{1+\varepsilon}a^{2}-\varepsilon b^{2} and (b) holds since 2M=Z22\left\|\bm{M}\right\|=\left\|\bm{Z}\right\|^{2} and W54Z\left\|\bm{W}\right\|\leq\frac{5}{4}\left\|\bm{Z}\right\|.

With this lemma in place, note that for any γ>0\gamma>0,

Here, (a) follows from Lemma 5.13, (b) follows because W54Z\left\|\bm{W}\right\|\leq\frac{5}{4}\left\|\bm{Z}\right\|, and (c) is another application of Young’s inequality. Combining (5.24) and (5.25) with the hypothesis that δ4r1/25\delta_{4r}\leq 1/25, and setting ε=4/625\varepsilon=4/625, γ=25/74\gamma=25/74 completes the proof.

4.3 Proofs for the initialization phase of Algorithm 2 (Theorem 3.3, Equation (3.6))

We start with the following generalization of Lemma 5.4, the proof of which is deferred to Appendix C.

The rest of the proof proceeds similarly to the proof of Equation (3.3). Using Lemma 5.1, we conclude that ρ(A)\rho(\mathcal{A}) from Lemma 5.2 is bounded by ρ(A)2δ4r2/25\rho(\mathcal{A})\leq 2\delta_{4r}\leq 2/25. Setting M~0=0n1×n2\widetilde{\bm{M}}_{0}=\bm{0}_{n_{1}\times n_{2}} and applying Lemma 5.2 to our initialization iterates, we have that

In order for the RHS to be bounded above by 12σr(M)\frac{1}{2}\sigma_{r}(\bm{M}), τ\tau must satisfy

When this happens, Lemma 5.14 tells us that

In order for this RHS to be bounded above by 18σr(M)\frac{1}{8}\sigma_{r}(\bm{M}), we require that M~τMF22116σr2(M)\left\|\widetilde{\bm{M}}_{\tau}-\bm{M}\right\|_{F}^{2}\leq\frac{\sqrt{2}-1}{16}\sigma_{r}^{2}(\bm{M}). Using Equation (5.26), it is sufficient for τ\tau to satisfy

Since MFrM\left\|\bm{M}\right\|_{F}\leq\sqrt{r}\left\|\bm{M}\right\|, setting T0T_{0} as T03log(rκ)+5T_{0}\geq 3\log(\sqrt{r}\kappa)+5 satisfies both (5.27) and (5.28).

Acknowledgements

BR is generously supported by ONR awards N00014-11-1-0723 and N00014-13-1-0129, NSF awards CCF-1148243 and CCF-1217058, AFOSR award FA9550-13-1-0138, and a Sloan Research Fellowship. RB is generously supported by ONR award N00014-11-1-0723 and the NDSEG Fellowship. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware.

References

Appendix A Proof of Lemma 5.4

Define H=UXR\bm{H}=\bm{U}-\bm{X}\bm{R}. Similar to the discussion at the beginning of Section 5.2.1, without loss of generality we can assume that (a) R=I\bm{R}=\bm{I}, (b) UTX0\bm{U}^{\mathsf{T}}\bm{X}\succeq 0, and (c) HTX=XTH\bm{H}^{\mathsf{T}}\bm{X}=\bm{X}^{\mathsf{T}}\bm{H}. With these simplifications, establishing the lemma is equivalent to showing that

holds with η=12(21)σr2(X)\eta=\frac{1}{2(\sqrt{2}-1)\sigma_{r}^{2}(\bm{X})}. We note that

Hence, a sufficient condition for (A.1) to hold is

Recalling that HTX=UTXXTX\bm{H}^{\mathsf{T}}\bm{X}=\bm{U}^{\mathsf{T}}\bm{X}-\bm{X}^{\mathsf{T}}\bm{X}, and that UTX0\bm{U}^{\mathsf{T}}\bm{X}\succeq 0, we have

Since UTX0\bm{U}^{\mathsf{T}}\bm{X}\succeq 0, to show (A.2) it suffices to show

The RHS trivially holds, concluding the proof.

Appendix B Proof of Lemma 3.4

From RIP and the assumption that δ2r1/10\delta_{2r}\leq 1/10, we have

We can upper bound the RHS by the following chain of inequalities,

where (a) follows from RIP, (b) follows since eτ320σr(M~)e_{\tau}\leq\frac{3}{20}\sigma_{r}(\widetilde{\bm{M}}) implies that

Appendix C Proof of Lemma 5.14

By simple algebraic manipulations we have

Applying Weyl’s inequality to (C), we have

Applying Lemma 5.4 to the matrices [X1X2Y1Y2]\begin{bmatrix}\bm{X}_{1}&\bm{X}_{2}\\ \bm{Y}_{1}&-\bm{Y}_{2}\end{bmatrix} and [X1X2Y1Y2]\begin{bmatrix}\bm{X}_{1}&\bm{X}_{2}\\ -\bm{Y}_{1}&\bm{Y}_{2}\end{bmatrix} and utilizing equations (C.1) and (C) we conclude that

Let ASBT\bm{A}\bm{S}\bm{B}^{\mathsf{T}} be the singular value decomposition of X1TX2+Y1TY2\bm{X}_{1}^{\mathsf{T}}\bm{X}_{2}+\bm{Y}_{1}^{\mathsf{T}}\bm{Y}_{2}. It is easy to verify that the solution R\bm{R} to the orthogonal Procrustes problem (equivalently the optimal rotation) between [X1X2Y1Y2]\begin{bmatrix}\bm{X}_{1}&\bm{X}_{2}\\ \bm{Y}_{1}&-\bm{Y}_{2}\end{bmatrix} and [X1X2Y1Y2]\begin{bmatrix}\bm{X}_{1}&\bm{X}_{2}\\ -\bm{Y}_{1}&\bm{Y}_{2}\end{bmatrix} is equal to R=[0ABTBAT0]\bm{R}=\begin{bmatrix}\bm{0}&\bm{AB}^{\mathsf{T}}\\ \bm{BA}^{\mathsf{T}}&\bm{0}\end{bmatrix}. From this we conclude that

Plugging the latter in (C.4) concludes the proof.

Appendix D Proof of the first inequality in Equation (3.2)

The first inequality is immediate from the following lemma.