ADMiRA: Atomic Decomposition for Minimum Rank Approximation

Kiryung Lee, Yoram Bresler

I Introduction

One method to solve the inverse problem by exploiting the prior that $X_{0}$ is low-rank is to solve the rank minimization problem P1, to minimize the rank within the affine space defined by $b$ and $\mathcal{A}$ :

In practice, in the presence of measurement noise or modeling error, a more appropriate measurement model is $b=\mathcal{A}X_{0}+\nu$ where the perturbation $\nu$ has bounded Euclidean norm, $\left\|\nu\right\|_{2}\leq\eta$ . In this case, the rank minimization problem is written as

Recently, several authors proposed methods for solving large scale SDP derived from rank minimization. These include interior point methods for SDP, projected subgradient methods, and low-rank parametrization combined with a customized interior point method . These methods can solve larger rank minimization problems, which the general purpose SDP solvers cannot. However, the dimension of the problem is still restricted and some of these methods do not guarantee convergence to the global minimum. Cai, Candes, and Shen proposed singular value thresholding (SVT), which penalizes the objective of nuclear norm minimization by the squared Frobenius norm. The dual of the penalized problem admits a projected subgradient method where the updates can be done by computing truncated singular value decompositions. They have shown that the solution given by SVT converges to the solution to nuclear norm minimization as the penalty parameter increases. However, an analysis of the convergence rate is missing and hence the quality of the solution obtained by this method is not guaranteed. Furthermore, the efficiency of SVT is restricted to the noiseless case where the constraint is affine (i.e., linear equality). Ma, Goldfarb, and Chen proposed a formulation of nuclear norm minimization by using the Bregman divergence that admits an efficient fixed point algorithm, which is also based on the singular value decomposition. They did not provide a convergence rate analysis and the efficiency is also restricted to the noiseless, affine constraint case. Meka et. al. used multiplicative updates and online convex programming to provide an approximate solution to rank minimization. However, their result depends on the (unverified) existence of an oracle that provides the solution to the rank minimization problem with a single linear constraint in constant time.

An alternative method to solve the inverse problem of compressed sensing of a matrix is minimum rank approximation,

Recently, several algorithms have been proposed to solve P2. Halder and Diego proposed an alternating least square approach by exploiting the explicit factorization of a rank- $r$ matrix. Their algorithm is computationally efficient but does not provide any performance guarantee. Keshavan, Oh, and Montanari proposed an algorithm based on optimization over the Grassmann manifold. Their algorithm first finds a good starting point by an operation called trimming and minimizes the objective of P2 using a line search and gradient descent over the Grassmann manifold. They provide a performance guarantee only for the matrix completion problem where the linear operator $\mathcal{A}$ takes a few entries from $X_{0}$ . Moreover, the performance guarantee is restricted to the noiseless case.

Minimum rank approximation, or rank- $r$ approximation for the matrix case, is analogous to $s$ -term approximation for the vector case. Like rank- $r$ matrix approximation, $s$ -term vector approximation is a way to find the sparsest solution of an ill-posed inverse problem in compressed sensing. For $s$ -term approximation, besides efficient greedy heuristics such as Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP) , there are recent algorithms, which are more efficient than convex relaxation and also have performance guarantees. These include Compressive Sampling Matching Pursuit (CoSaMP) and Subspace Pursuit (SP) . To date, no such algorithms have been available for the matrix case.

In this paper, we propose an iterative algorithm for the rank minimization problem, which is a generalization There is another generalization of CoSaMP, namely model-based CoSaMP . However, this generalization addresses a completely different and unrelated problem: sparse vector approximation subject to a special (e.g., tree) structure. Furthermore, the extensions of CoSaMP to model-based CoSaMP and to ADMiRA are independent: neither one follows from the other, and neither one is a special case of the other. of the CoSaMP algorithm to the matrix case. We call this algorithm “Atomic Decomposition for Minimum Rank Approximation,” abbreviated as ADMiRA. ADMiRA is computationally efficient in the sense that the core computation consists of least squares and truncated singular value decompositions, which are both basic linear algebra problems and admit efficient algorithms. Indeed, ADMiRA is the first guaranteed algorithms among those proposed to solve minimum rank approximation ADMiRA was followed by the algorithm by Keshavan et. al. . This short version will be presented at ISIT’09. . Furthermore, ADMiRA provides a strong performance guarantee for P2 that covers the general case where $X_{0}$ is only approximately low-rank and $b$ contains noise. The strong performance guarantee of ADMiRA is comparable to that of nuclear norm minimization in . In the noiseless case, SVT may be considered a competitor to ADMiRA. However, for the noisy case, SVT involves more than the simple singular value thresholding operation.

Matrix completion is a special case of low-rank matrix approximation from linear measurements where the linear operator takes a few random entries of the unknown matrix. It has received considerable attention owing to its important applications such as collaborative filtering. However, the linear operator in matrix completion does not satisfy the rank-restricted isometry property . Therefore, at the present time, ADMiRA does not have a guarantee for matrix completion. None the less, empirical performance on matrix completion is better than SVT (for the experiments in this paper).

The remaining of this paper is organized as follows: The atomic decomposition and the analogy between the greedy algorithm for the vector case and the matrix case are introduced in Section II. The new algorithm ADMiRA and its performance guarantee are explained in Section III and in Section IV, respectively. By using the tools in Section V, the performance guarantees are derived in Section VI and Section VII. Implementation issues and the computational complexity are discussed in Section VIII and numerical results in Section IX, followed by conclusions. Our exposition of ADMiRA follows the line of Needell and Tropp’s exposition of CoSaMP , to highlight, on the one hand, the close analogy, and on the other hand the differences between the two algorithms and their analysis. Indeed, there exist significant differences between rank- $r$ approximation for the matrix case and $s$ -term approximation for the vector case, which are discussed in some detail.

II Vector vs Matrix

II-B Atomic Decomposition

II-C Generalized Correlation Maximization

it follows that maximizing the correlation implies maximizing the norm of the projection of the image under $A^{H}$ of the residual $b-A\hat{x}$ onto the selected one dimensional subspace.

Applying the selection rule (2) to update $\widehat{X}$ recursively leads to greedy algorithms generalizing MP and OMP to rank minimization.

Next, consider the rule in recent algorithms such as CoSaMP and SP. The selection rule chooses the subset $J$ of $\{1,\ldots,n\}$ with $|J|=s$ defined by

III Algorithm

ADMiRA is guaranteed to converge to the global optimum in at most $6(r+1)$ iterations when the assumptions of ADMiRA in Section IV are satisfied. However, similarly to the vector case , it is more difficult to verify the satisfiability of the assumptions than solve the recovery problem itself, and to date there is no known algorithm to perform this verification. Instead of relying on the theoretical bound on the number of iterations, we use an empirical stopping criterion below. If either the monotone decrease of $\|b-\mathcal{A}\widehat{X}\|_{2}/\left\|b\right\|_{2}$ is broken or $\|b-\mathcal{A}\widehat{X}\|_{2}/\left\|b\right\|_{2}$ falls a given threshold, ADMiRA stops.

IV Main Results: Performance Guarantee

IV-B Performance Guarantee

Subject to the R-RIP, the Atomic Decomposition for Minimum Rank Approximation Algorithm (ADMiRA) has a performance guarantee analogous to that of CoSaMP.

The followings are the assumptions in ADMiRA:

The linear operator $\mathcal{A}$ satisfies $\delta_{4r}(\mathcal{A})\leq 0.04$ .

where $\nu$ is the discrepancy between the measurement and the linear model $\mathcal{A}X_{0}$ . No assumptions are made about the matrix $X_{0}$ underlying the measurement, and it can be arbitrary.

Assumption A2 plays a key role in deriving the performance guarantee of ADMiRA: it enforces the rank-restricted isometry property of the linear operator $\mathcal{A}$ . Although the verification of the satisfiability of A2 is as difficult as or more difficult than the recovery problem itself, as mentioned above, nearly isometric families that satisfy the condition in A2 have been demonstrated .

The performance guarantees are specified in terms of a measure of inherent approximation error, termed unrecoverable energy defined by

where $X_{0,r}$ denotes the best rank- $r$ approximation of $X_{0}$ . The first two terms in $\epsilon$ define a metric of the minimum distance between the “true” matrix $X_{0}$ and a rank- $r$ matrix. This is analogous to the notion of a measure of compressibility of a vector in sparse vector approximation. By the Eckart-Young-Mirsky theorem , no rank- $r$ matrix can come closer to $X_{0}$ in this metric. In particular, the optimal solution to P2 cannot come closer to $X_{0}$ in this metric. The third term is the norm of the measurement noise, which must also limit the accuracy of the approximation provided by a solution to P2.

Let $\widehat{X}_{k}$ denote the estimate of $X_{0}$ in the $k$ -th iteration of ADMiRA. For each $k\geq 0$ , $\widehat{X}_{k}$ satisfies the following recursion:

where $\epsilon$ is the unrecoverable energy. From the above relation, it follows that

Theorem IV.1 shows the geometric convergence of ADMiRA. In fact, convergence in a finite number of steps can be achieved as stated by the following theorem.

After at most $6(r+1)$ iterations, ADMiRA provides a rank- $r$ approximation $\widehat{X}$ of $X_{0}$ , which satisfies

where $\epsilon$ is the unrecoverable energy.

Depending on the spectral properties of the matrix $X_{0}$ , even faster convergence is possible (See Section VII for details).

IV-C Relationship between P1, P2, and ADMiRA

For the noisy measurement case, the linear constraint in P1 is replaced by a quadratic constraint and the rank minimization problem is written as:

V Properties of the Rank-Restricted Isometry

We introduce and prove a number of properties of the rank-restricted isometry. These properties serve as key tools for proving the performance guarantees for ADMiRA in this paper. These properties further extend the analogy between the sparse vector and the low-rank matrix approximation problems (P3 and P2, respectively), and are therefore also of interest in their own right. The proofs are contained in the Appendix.

The rank-restricted isometry constant $\delta_{r}(\mathcal{A})$ is nondecreasing in $r$ .

An operator satisfying the R-RIP satisfies, as a consequence, a number of other properties when composed with other linear operators defined by the atomic decomposition.

The following rank-restricted orthogonality property for the matrix case is analogous to the sparsity-restricted orthogonality property for the vector case (Lemma 2.1 in ).

For the real matrix case, Proposition 12 can be improved by dropping the constant $\sqrt{2}$ . This improvement is achieved by replacing the parallelogram identity in the proof to the version for the real scalar field case. This argument also applies to Corollary 13.

Finally, we relate the R-RIP to the nuclear norm, extending the analogous result from the $r$ -sparse vector case to the rank- $r$ matrix case.

VI Proof of Theorem IV.1

Theorem VI.1 is proved by applying a sequence of lemmata. We generalize the proof of the performance guarantee for CoSaMP to the matrix case by applying the generalized analogy proposed in this paper. The flow and the techniques used in the proofs are similar to those in . However, in the matrix case, there are additional unknowns in the form of the singular vectors. Therefore, the generalization of the proofs in to the matrix case is not straightforward and the proofs are sufficiently different from those for the vector case to warrant detailed exposition. The main steps in the derivation of the performance guarantee are stated in this section and the detailed proofs are in the Appendix.

For the proof, we study the $(k+1)$ -th iteration starting with the previous result in the $k$ -th iteration. Let $X_{0}$ denote the true solution with rank $r$ . Matrix $\widehat{X}$ denotes $\widehat{X}_{k}$ , which is the estimate of $X_{0}$ in the $k$ -th (previous) iteration. Set $\widehat{\Psi}$ is the set of orthogonal atoms obtained in the previous iteration. From $(b-\mathcal{A}\widehat{X})$ , we compute the proxy matrix $\mathcal{A}^{*}(b-\mathcal{A}\widehat{X})$ . Set $\Psi^{\prime}$ is the solution of the following low rank approximation problem:

Lemma VI.2 shows that subject to the rank-restricted isometry property, the set $\Psi^{\prime}$ of atoms chosen in Step 4 of ADMiRA is a good set: it captures $94\%(=1-0.24^{2})$ of the energy of the atoms in $X_{0}$ that were not captured by $\widehat{X}$ , and the effects of additive measurement noise are bounded by a small constant. In other words, the algorithm is guaranteed to make good progress in this step.

Lemma VI.3 shows that the augmented set of atoms $\widetilde{\Psi}$ produced in Step 5 of the algorithm is at least as good in explaining the unknown $X_{0}$ as was the set $\Psi^{\prime}$ in explaining the part of $X_{0}$ not captured by the estimate $\widehat{X}$ from the previous iteration.

As expected, reducing the rank of the estimate $\widetilde{X}$ from $3r$ to $r$ , to produce $\widetilde{X}_{r}$ , increases the approximation error. However, Lemma VI.5 shows that this increase is moderate – by no more than a factor of 2.

The update $\widehat{X}_{k+1}=\widetilde{X}_{r}$ completes the $(k+1)$ -th iteration. Combining all the results in the lemmata provides the proof of Theorem VI.1.

The recursion together with the fact that $\sum_{j=0}^{k}2^{-j}\leq\sum_{j=0}^{\infty}2^{-j}=2$ provide the final result. ∎

VI-B General Matrix Case

Theorem IV.1 is proved by combining Theorem VI.1 and the following lemma, which shows how to convert the mismodeling error (deviations of $X_{0}$ from a low rank matrix) to an equivalent additive measurement noise with a quantified norm.

Let $\widetilde{\nu}=\mathcal{A}(X_{0}-X_{0,r})+\nu$ . Then $b=\mathcal{A}X_{0,r}+\widetilde{\nu}$ .

where the last inequality holds by Proposition V.11. The inequality $\delta_{r}(\mathcal{A})\leq\delta_{4r}(\mathcal{A})\leq 0.04$ implies $\sqrt{1+\delta_{r}(\mathcal{A})}\leq 1.02$ . ∎

Applying the triangle inequality and the above inequality,

Using the upper bound on $\left\|\widetilde{\nu}\right\|_{2}$ yields

where $\epsilon$ is the unrecoverable energy. ∎

VII Required Number of Iterations

Theorem IV.2 provides a uniform bound on the number of iterations required to achieve the guaranteed approximation accuracy. In addition to this uniform iteration bound, Theorem VII.6 in this section shows that even faster convergence may be expected for matrices $X$ with clustered singular values.

In the analysis of the iteration number, the distribution of the singular values of the matrices involved is the only thing that matters. Indeed, the singular vectors do not play any role in the analysis. As a consequence, the proofs for the vector case (CoSaMP) and the matrix case (ADMiRA) are very similar, and the corresponding bounds on the number of iterations coincide. However, for the completeness, we provide the proofs for the matrix case.

For the vector case, the term analogous to the atomic band is the component band defined by

First, the number of iterations for the exactly low-rank case is bounded by the following theorem.

iterations, the estimate $\widehat{X}$ produced by ADMiRA satisfies

We introduce additional notations for the proof of Theorem VII.3. Let $\widehat{X}_{k}$ denote the estimate at the $k$ -th iteration of ADMiRA. For a nonnegative integer $j$ , we define an auxiliary matrix

The proof of Theorem VII.3 is done by a sequence of lemmata. The first lemma presents two possibilities in each iteration of ADMiRA: if the iteration is successful, the approximation error is small; otherwise, the approximation error is dominated by the un-identified portion of the matrix and the approximation error in the next iteration decreases by a constant ratio.

Next, the result is extended to the approximately low-rank case by using, once again, Lemma VI.6.

iterations, the estimate $\widehat{X}$ produced by ADMiRA satisfies

where $\epsilon$ is the unrecoverable energy.

(Theorem IV.2) As a function of $t$ , $(t\log_{4/3}(1+4.3\sqrt{r/t})+6)$ is maximized when $t=r$ . Since $\log_{4/3}5.6<6$ , the number of iterations is at most $6(r+1)$ . Therefore, the approximation error of ADMiRA is achieved within $6(r+1)$ iterations for any matrix $X_{0}$ . ∎

VIII Implementation and Scalability

We analyze the computational complexity of ADMiRA and will show that ADMiRA scales well to large problem instances. Each iteration of ADMiRA consists of procedures requiring the following basic operations: application of $\mathcal{A}$ and $\mathcal{A}^{*}$ , singular value decompositions, and solving a least square problem. We analyze the computational cost of the procedures in terms of the complexity of the basic operations, which will depend on the properties of $\mathcal{A}$ . First note that ADMiRA keeps the matrix variables (except the proxy matrix) in factorized form through their atomic decomposition, which is advantageous for both the computational efficiency and memory requirements. Furthermore, the proxy matrix is often sparse in applications such as the matrix completion problem.

: $\mathcal{A}$ is an arbitrary linear (dense) operator and the costs of computing $\mathcal{A}\widehat{X}$ and $\mathcal{A}^{*}y$ are $O(prmn)$ and $O(pmn)$ , respectively.

: $\mathcal{A}$ is a sparse linear operator – so the $Z_{k}$ have $O(m+n)$ non-zero elements, and and the costs of computing $\mathcal{A}\widehat{X}$ and $\mathcal{A}^{*}y$ are $O(pr(m+n))$ and $O(p(m+n))$ , respectively.

: $\mathcal{A}$ is an extremely sparse linear operator (such as in the matrix completion problem), so the $Z_{k}$ have $O(1)$ nonzeros, and the costs of computing $\mathcal{A}\widehat{X}$ and $\mathcal{A}^{*}y$ are $O(pr)$ and $O(p)$ , respectively.

Solving least square problems: ADMiRA requires the solution of an over-determined system with $p$ equations and $3r$ unknowns. The complexity is $O(pr^{2})$ . Similarly to CoSaMP, the Richardson iteration or the conjugate gradient method can be used to improve the complexity of this part. The convergence of the Richardson iteration is guaranteed owing to the R-RIP assumption of ADMiRA and the complexity is $O(pr)$ .

Applications of $\mathcal{A}$ and $\mathcal{A}^{*}$ are the most demanding procedures of ADMiRA for a dense linear operator $\mathcal{A}$ . These operations are also required in all other algorithms for P1, P1’, or P2. To overcome this computational complexity, the linear operator $\mathcal{A}$ should have some structure that admits efficient computation. Examples include random Toeplitz matrices and randomly subsampled Fourier measurements. For matrix completion, $\mathcal{A}$ is a sparse with $O(1)$ cost per measurement and hence these operations are dominated by the remaining operations. In this case, the computation of the truncated singular value decomposition is the most demanding procedure of ADMiRA. Equipped with the randomized low rank approximation, ADMiRA has complexity of $O(pr^{2}\log r)$ per iteration, or $O(pr^{3}\log r)$ to achieve the guarantee in Theorem IV.2. ADMiRA therefore has complexity linear in the size $p$ of the data, and it scales well to large problems.

IX Numerical Experiment

Table II shows that in most of the examples tested, ADMiRA provides slightly better performance with less computation than SVT . Roughly, the computational complexity of a single iteration of ADMiRA can be compared to two times that of SVT.

We emphasize that all comparisons with SVT were performed for the noiseless exactly low rank matrix case, because the current implementation and theory of SVT do not support the ellipsoidal constraint case. We are not aware of an efficient, scalable algorithm other than ADMiRA that supports the ellipsoidal constraint.

X Conclusion

The rank-restricted isometry constant $\delta_{r}(\mathcal{A})$ can be represented as

where $\sigma_{r,\max}(\mathcal{A})$ and $\sigma_{r,\min}(\mathcal{A})$ are defined by

respectively. As $r$ increases, the feasible sets of both problems increase and hence $\sigma_{r,\max}(\mathcal{A})$ and $\sigma_{r,\min}(\mathcal{A})$ are nondecreasing and nonincreasing, respectively. Therefore, (22) implies that $\delta_{r}(\mathcal{A})$ is nondecreasing in $r$ .

-B Proof of Proposition V.4

This implies that the operator norm of $\mathcal{A}\mathcal{L}_{\Phi}$ is bounded from above by $\sqrt{1+\delta_{r}(\mathcal{A})}$ . Since the adjoint operator $\left[\mathcal{A}\mathcal{L}_{\Phi}\right]^{*}$ has the same operator norm,

Then (9) follows from (24) with $\mathcal{L}_{\Phi}\left[\mathcal{A}\mathcal{L}_{\Phi}\right]^{*}=\mathcal{L}_{\Phi}\mathcal{L}_{\Phi}^{*}\mathcal{A}^{*}=\mathcal{P}_{\Psi}\mathcal{A}^{*}$ .

-C Proof of Proposition 10

-D Proof of Proposition 11

where the first equality holds since $\mathcal{L}_{\Phi}$ is an isometry. By the relationship between $\mathcal{A}\mathcal{L}_{\Phi}$ and $\mathcal{L}_{\Phi}^{*}\mathcal{A}^{*}\mathcal{A}\mathcal{L}_{\Phi}$ , it follow that

-E Proof of Proposition 12

In particular, the inequality holds for $\alpha=\pm 1,\pm i$ where $i=\sqrt{-1}$ . By the parallelogram identity,

-F Proof of Corollary 13

Since $Y$ was arbitrary, we can take $Y=\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{\Psi}X$ . Then

-G Proof of Proposition V.11

We modify the proof the analogous result for the vector case in for our proposition.

The claim of the proposition is equivalent to

It suffices to show that $K\subset S$ . Let $X$ be an element in $K$ . Consider the singular value decomposition of $X$ ,

From the definition of $c_{0}$ , it follows that $c_{0}\leq\left\|X\right\|_{F}$ . Since $X\in K$ , we note

-H Proof of Lemma VI.2

as do $\mathcal{P}_{\Upsilon}$ and $\mathcal{P}_{\Psi^{\prime}}$ ,

From the commutativity, we note that $\mathcal{P}_{\Upsilon}^{\perp}\mathcal{P}_{\Phi}$ and $\mathcal{P}_{\Upsilon}^{\perp}\mathcal{P}_{\Psi^{\prime}}$ are projection operators. Furthermore, each of the projection operators $\mathcal{P}_{\Phi}$ and $\mathcal{P}_{\Psi^{\prime}}$ can be decomposed as the sum of two mutually orthogonal projection operators:

Applying (28) and (29) to (25), invoking the Pythagorean theorem, and removing the common term containing $\mathcal{P}_{\Upsilon}$ gives

First, we derive an upper bound on the right hand side of inequality (30).

Using Proposition V.4 and $|\Psi^{\prime}|\leq 2r$ , the second term of (31) is bounded by

The first term of (31) is further bounded by

Combining the previous results, we have the following upper bound on the right hand side of inequality (30).

Next, we derive a lower bound on the left hand side of inequality (30).

Using Proposition V.4, the second term of (33) is further bounded by

The first term of (33) is further bounded by

where the inequality (34) follows from Proposition 11 with $\mathcal{P}_{\Upsilon}^{\perp}\mathcal{P}_{\Phi}=\mathcal{P}_{\Phi}\mathcal{P}_{\Upsilon}^{\perp}$ and $|\Phi|\leq 2r$ and Corollary 13, and the last inequality (35) follows from the fact that $(\Psi^{\prime})^{\perp}\subset(\Upsilon)^{\perp}$ .

Combining the previous results, we have the following lower bound on the left hand side of inequality (30).

where the second inequality is obtained by maximizing over $\mathcal{P}_{\Upsilon}$ with the constraint

Substituting $\delta_{2r}(\mathcal{A})\leq\delta_{4r}(\mathcal{A})\leq 0.04$ gives the constants in the final inequality.

-I Proof of Lemma VI.3

Since $\widehat{\Psi}\subset\widetilde{\Psi}$ , $\mathcal{P}_{\widehat{\Psi}}^{\perp}\widehat{X}=0$ implies $\mathcal{P}_{\widetilde{\Psi}}^{\perp}\widehat{X}=0$ and hence

where the inequality holds since $\Psi^{\prime}\subset\widetilde{\Psi}$ implies $(\widetilde{\Psi})^{\perp}\subset(\Psi^{\prime})^{\perp}$ .

-J Proof of Lemma VI.4

where the last inequality follows from the R-RIP of $\mathcal{A}$ .

The first term in (38) has the following upper bound

Finally, combining (38), (39), and (40) yields

Applying $\delta_{3r}(\mathcal{A})\leq\delta_{4r}(\mathcal{A})\leq 0.04$ completes the proof.

-K Proof of Lemma VI.5

where the second inequality holds by the definition of the best rank- $r$ approximation.

-L Proof of Lemma VII.4

In the $k$ -th iteration, the generalized correlation maximization rule chooses $\Psi_{k}^{\prime}$ . Let $\widetilde{\Psi}_{k}=\widehat{\Psi}_{k-1}\cup\Psi_{k}^{\prime}$ in the $k$ -th iteration. Since $\widehat{\Psi}_{k}$ is chosen as a subset of $\widetilde{\Psi}_{k}$ ,

If $\left\|\mathcal{P}_{\widehat{\Psi}_{k}}^{\perp}X_{0}\right\|_{F}<30\left\|\nu\right\|_{2}$ for some $k$ , then (41) implies (19). The first possibility has been shown.

Otherwise, we need to show that (20) and (21) hold.

Therefore Theorem VI.1 ensures that (21) holds.

-M Proof of Lemma VII.5

Define $J$ be the set of indices for nonempty atomic bands

(Claim 1) First, we show that $B_{j}\subset\widehat{\Psi}_{k}$ . Assume that $B_{j}\not\subset\widehat{\Psi}_{k}$ . Then there exists $\psi\in B_{j}$ such that $\psi\not\in\widehat{\Psi}_{k}$ . This implies

which is a contradiction. From the assumption, (21) ensures that

where the inequality follows from (47). Also note that by assumption (20) holds. Combining (20) and (48), it follows that

The total number of iterations required to identify $B_{j}$ for all $j\in J$ is at most

Recall the bound on $\left\|Y_{j}\right\|_{F}$ in (18). We use Jensen’s inequality again and simplify the result.

Combining (LABEL:eq:gamean_ineq1) and (51), we have

Taking the logarithm, multiplying by $t$ , and dividing both sides by $\ln\beta$ , we have

-N Proof of Theorem VII.3

This contradicts the assumption that (19) never holds during the first $K$ iterations. Therefore, there exists $k\leq K$ where (19) holds. Repeated application of Theorem VI.1 gives

-O Proof of Theorem VII.6

iteration, the approximation error satisfies

where the third inequality follows from Lemma VI.6.