Non-convex Robust PCA

Praneeth Netrapalli, U N Niranjan, Sujay Sanghavi, Animashree Anandkumar, Prateek Jain

Keywords:

Robust PCA, matrix decomposition, non-convex methods, alternating projections.

Introduction

Principal component analysis (PCA) is a common procedure for preprocessing and denoising, where a low rank approximation to the input matrix (such as the covariance matrix) is carried out. Although PCA is simple to implement via eigen-decomposition, it is sensitive to the presence of outliers, since it attempts to “force fit” the outliers to the low rank approximation. To overcome this, the notion of robust PCA is employed, where the goal is to remove sparse corruptions from an input matrix and obtain a low rank approximation. Robust PCA has been employed in a wide range of applications, including background modeling [LHGT04], 3d reconstruction [MZYM11], robust topic modeling [Shi13], and community detection [CSX12], and so on.

Concretely, robust PCA refers to the following problem: given an input matrix M=L+SM=L^{*}+S^{*}, the goal is to decompose it into sparse SS^{*} and low rank LL^{*} matrices. The seminal works of [CSPW11, CLMW11] showed that this problem can be provably solved via convex relaxation methods, under some natural conditions on the low rank and sparse components. While the theory is elegant, in practice, convex techniques are expensive to run on a large scale and have poor convergence rates. Concretely, for decomposing an m×nm\times n matrix, say with mnm\leq n, the best specialized implementations (typically first-order methods) have a per-iteration complexity of O(m2n)O\left({m^{2}n}\right), and require O(1/ϵ)O(1/\epsilon) number of iterations to achieve an error of ϵ\epsilon. In contrast, the usual PCA, which carries out a rank-rr approximation of the input matrix, has O(rmn)O(rmn) complexity per iteration – drastically smaller when rr is much smaller than m,nm,n. Moreover, PCA requires exponentially fewer iterations for convergence: an ϵ\epsilon accuracy is achieved with only O(log(1/ϵ))O\left({\log(1/\epsilon)}\right) iterations (assuming constant gap in singular values).

In this paper, we design a non-convex algorithm which is “best of both the worlds” and bridges the gap between (the usual) PCA and convex methods for robust PCA. Our method has low computational complexity similar to PCA (i.e. scaling costs and convergence rates), and at the same time, has provable global convergence guarantees, similar to the convex methods. Proving global convergence for non-convex methods is an exciting recent development in machine learning. Non-convex alternating minimization techniques have recently shown success in many settings such as matrix completion [Kes12, JNS13, Har13], phase retrieval [NJS13], dictionary learning [AAJ+13], tensor decompositions for unsupervised learning [AGH+12], and so on. Our current work on the analysis of non-convex methods for robust PCA is an important addition to this growing list.

We propose a simple intuitive algorithm for robust PCA with low per-iteration cost and a fast convergence rate. We prove tight guarantees for recovery of sparse and low rank components, which match those for the convex methods. In the process, we derive novel matrix perturbation bounds, when subject to sparse perturbations. Our experiments reveal significant gains in terms of speed-ups over the convex relaxation techniques, especially as we scale the size of the input matrices.

Our method consists of simple alternating (non-convex) projections onto low-rank and sparse matrices. For an m×nm\times n matrix, our method has a running time of O(r2mnlog(1/ϵ))O(r^{2}mn\log(1/\epsilon)), where rr is the rank of the low rank component. Thus, our method has a linear convergence rate, i.e. it requires O(log(1/ϵ))O(\log(1/\epsilon)) iterations to achieve an error of ϵ\epsilon, where rr is the rank of the low rank component LL^{*}. When the rank rr is small, this nearly matches the complexity of PCA, (which is O(rmnlog(1/ϵ))O(rmn\log(1/\epsilon))).

We prove recovery of the sparse and low rank components under a set of requirements which are tight and match those for the convex techniques (up to constant factors). In particular, under the deterministic sparsity model, where each row and each column of the sparse matrix SS^{*} has at most α\alpha fraction of non-zeros, we require that α=O(1/(μ2r))\alpha=O\left({1/(\mu^{2}r)}\right), where μ\mu is the incoherence factor (see Section 3).

In addition to strong theoretical guarantees, in practice, our method enjoys significant advantages over the state-of-art solver for (1), viz., the inexact augmented Lagrange multiplier (IALM) method [CLMW11]. Our method outperforms IALM in all instances, as we vary the sparsity levels, incoherence, and rank, in terms of running time to achieve a fixed level of accuracy. In addition, on a real dataset involving the standard task of foreground-background separation [CLMW11], our method is significantly faster and provides visually better separation.

2 Related Work

Guaranteed methods for robust PCA have received a lot of attention in the past few years, starting from the seminal works of [CSPW11, CLMW11], where they showed recovery of an incoherent low rank matrix LL^{*} through the following convex relaxation method:

[CSPW11] and [CLMW11] consider two different models of sparsity for SS^{*}. Chandrasekaran et al. [CSPW11] consider a deterministic sparsity model, where each row and column of the m×nm\times n matrix, SS, has at most α\alpha fraction of non-zero entries. For guaranteed recovery, they require α=O(1/(μ2rn))\alpha=O\left({1/(\mu^{2}r\sqrt{n})}\right), where μ\mu is the incoherence level of LL^{*}, and rr is its rank. Hsu et al. [HKZ11] improve upon this result to obtain guarantees for an optimal sparsity level of α=O(1/(μ2r))\alpha=O\left({1/(\mu^{2}r)}\right). This matches the requirements of our non-convex method for exact recovery. Note that when the rank r=O(1)r=O(1), this allows for a constant fraction of corrupted entries. Candès et al. [CLMW11] consider a different model with random sparsity and additional incoherence constraints, viz., they require UV<μr/n\|UV^{\top}\|_{\infty}<\mu\sqrt{r}/n. Note that our assumption of incoherence, viz., U(i)<μr/n\|U^{(i)}\|<\mu\sqrt{r/n}, only yields UV<μ2r/n\|UV^{\top}\|_{\infty}<\mu^{2}r/n. The additional assumption enables [CLMW11] to prove exact recovery with a constant fraction of corrupted entries, even when LL^{*} is nearly full-rank. We note that removing the UV\|UV^{\top}\|_{\infty} condition for robust PCA would imply solving the planted clique problem when the clique size is less than n\sqrt{n} [Che13]. Thus, our recovery guarantees are tight upto constants without these additional assumptions.

A number of works have considered modified models under the robust PCA framework, e.g. [ANW12, XCS12]. For instance, Agarwal et al. [ANW12] relax the incoherence assumption to a weaker “diffusivity” assumption, which bounds the magnitude of the entries in the low rank part, but incurs an additional approximation error. Xu et al.[XCS12] impose special sparsity structure where a column can either be non-zero or fully zero.

In terms of state-of-art specialized solvers, [CLMW11] implements the in-exact augmented Lagrangian multipliers (IALM) method and provides guidelines for parameter tuning. Other related methods such as multi-block alternating directions method of multipliers (ADMM) have also been considered for robust PCA, e.g. [WHML13]. Recently, a multi-step multi-block stochastic ADMM method was analyzed for this problem [SAJ14], and this requires 1/ϵ1/\epsilon iterations to achieve an error of ϵ\epsilon. In addition, the convergence rate is tight in terms of scaling with respect to problem size (m,n)(m,n) and sparsity and rank parameters, under random noise models.

There is only one other work which considers a non-convex method for robust PCA [KC12]. However, their result holds only for significantly more restrictive settings and does not cover the deterministic sparsity assumption that we study. Moreover, the projection step in their method can have an arbitrarily large rank, so the running time is still O(m2n)O(m^{2}n), which is the same as the convex methods. In contrast, we have an improved running time of O(r2mn)O(r^{2}mn).

Algorithm

In this section, we present our algorithm for the robust PCA problem. The robust PCA problem can be formulated as the following optimization problem: find L,SL,S s.t. MLSFϵ\|M-L-S\|_{F}\leq\epsilonϵ\epsilon is the desired reconstruction error and

LL lies in the set of low-rank matrices,

A natural algorithm for the above problem is to iteratively project MLM-L onto the set of sparse matrices to update SS, and then to project MSM-S onto the set of low-rank matrices to update LL. Alternatively, one can view the problem as that of finding a matrix LL in the intersection of the following two sets: a) L={\mathcal{L}=\{ set of rank-rr matrices}\}, b) \mathcal{S}_{M}=\{M-S,\ where SS is a sparse matrix}\}. Note that these projections can be done efficiently, even though the sets are non-convex. Hard thresholding (HT) is employed for projections on to sparse matrices, and singular value decomposition (SVD) is used for projections on to low rank matrices. Rank-11 case: We first describe our algorithm for the special case when LL^{*} is rank 1. Our algorithm performs an initial hard thresholding to remove very large entries from input MM. Note that if we performed the projection on to rank-11 matrices without the initial hard thresholding, we would not make any progress since it is subject to large perturbations. We alternate between computing the rank-11 projection of MSM-S, and performing hard thresholding on MLM-L to remove entries exceeding a certain threshold. This threshold is gradually decreased as the iterations proceed, and the algorithm is run for a certain number of iterations (which depends on the desired reconstruction error). General rank case: When LL^{*} has rank r>1r>1, a naive extension of our algorithm consists of alternating projections on to rank-rr matrices and sparse matrices. However, such a method has poor performance on ill-conditioned matrices. This is because after the initial thresholding of the input matrix MM, the sparse corruptions in the residual are of the order of the top singular value (with the choice of threshold as specified in the algorithm). When the lower singular values are much smaller, the corresponding singular vectors are subject to relatively large perturbations and thus, we cannot make progress in improving the reconstruction error. To alleviate the dependence on the condition number, we propose an algorithm that proceeds in stages. In the k\mboxthk^{{\mbox{\tiny th}}} stage, the algorithm alternates between rank-kk projections and hard thresholding for a certain number of iterations. We run the algorithm for rr stages, where rr is the rank of LL^{*}. Intuitively, through this procedure, we recover the lower singular values only after the input matrix is sufficiently denoised, i.e. sparse corruptions at the desired level have been removed. Figure 1 shows a pictorial representation of the alternating projections in different stages.

Parameters: As can be seen, the only real parameter to the algorithm is β\beta, used in thresholding, which represents “spikiness” of LL^{*}. That is if the user expects LL^{*} to be “spiky” and the sparse part to be heavily diffused, then higher value of β\beta can be provided. In our implementation, we found that selecting β\beta aggressively helped speed up recovery of our algorithm. In particular, we selected β=1/n\beta=1/\sqrt{n}.

Complexity: The complexity of each iteration within a single stage is O(kmn)O(kmn), since it involves calculating the rank-kk approximationNote that we only require a rank-kk approximation of the matrix rather than the actual singular vectors. Thus, the computational complexity has no dependence on the gap between the singular values. of an m×nm\times n matrix (done e.g. via vanilla PCA). The number of iterations in each stage is O(log(1/ϵ))O\left({\log\left(1/\epsilon\right)}\right) and there are at most rr stages. Thus the overall complexity of the entire algorithm is then O(r2mnlog(1/ϵ))O(r^{2}mn\log(1/\epsilon)). This is drastically lower than the best known bound of O(m2n/ϵ)O\left({m^{2}n/\epsilon}\right) on the number of iterations required by convex methods, and just a factor rr away from the complexity of vanilla PCA.

Analysis

In this section, we present our main result on the correctness of AltProj. We assume the following conditions:

LL^{*} is μ\mu-incoherent, i.e., if L=UΣ(V)L^{*}=U^{*}\Sigma^{*}(V^{*})^{\top} is the SVD of LL^{*}, then (U)i2μrm,\|(U^{*})^{i}\|_{2}\leq\frac{\mu\sqrt{r}}{\sqrt{m}}, 1im\forall 1\leq i\leq m and (V)i2μrn\|(V^{*})^{i}\|_{2}\leq\frac{\mu\sqrt{r}}{\sqrt{n}}, 1in\forall 1\leq i\leq n, where (U)i(U^{*})^{i} and (V)i(V^{*})^{i} denote the ithi^{\textrm{th}} rows of UU^{*} and VV^{*} respectively.

Each row and column of SS have at most α\alpha fraction of non-zero entries such that α1512μ2r\alpha\leq\frac{1}{512\mu^{2}r}.

Note that in general, it is not possible to have a unique recovery of low-rank and sparse components. For example, if the input matrix MM is both sparse and low rank, then there is no unique decomposition (e.g. M=e1e1M=\bm{e}_{1}\bm{e}_{1}^{\top}). The above conditions ensure uniqueness of the matrix decomposition problem.

Additionally, we set the parameter β\beta in Algorithm 1 be set as β=4μ2rmn\beta=\frac{4\mu^{2}r}{\sqrt{mn}}.

We now establish that our proposed algorithm recovers the low rank and sparse components under the above conditions.

Under conditions (L1)(L1), (L2)(L2) and SS^{*}, and choice of β\beta as above, the outputs L^\widehat{L} and S^\widehat{S} of Algorithm 1 satisfy:

Remark (tight recovery conditions): Our result is tight up to constants, in terms of allowable sparsity level under the deterministic sparsity model. In other words, if we exceed the sparsity limit imposed in S1, it is possible to construct instances where there is no unique decompositionFor instance, consider the n×nn\times n matrix which has rr copies of the all ones matrix, each of size nr\frac{n}{r}, placed across the diagonal. We see that this matrix has rank rr and is incoherent with parameter μ=1\mu=1. Note that a fraction of α=O(1/r)\alpha=O\left({1/r}\right) sparse perturbations suffice to erase one of these blocks making it impossible to recover the matrix.. Our conditions L1, L2 and S1 also match the conditions required by the convex method for recovery, as established in [HKZ11]. Remark (convergence rate): Our method has a linear rate of convergence, i.e. O(log(1/ϵ))O(\log(1/\epsilon)) to achieve an error of ϵ\epsilon, and hence we provide a strongly polynomial method for robust PCA. In contrast, the best known bound for convex methods for robust PCA is O(1/ϵ)O(1/\epsilon) iterations to converge to an ϵ\epsilon-approximate solution.

Under conditions (L1)(L1), (L2)(L2) and SS^{*}, and choice of β\beta as in Theorem 1, when the noise Nσr(L)100n\left\lVert N^{*}\right\rVert_{\infty}\leq\frac{\sigma_{r}(L^{*})}{100n},the outputs L^,S^\widehat{L},\widehat{S} of Algorithm 1 satisfy:

We now present the key steps in the proof of Theorem 1. A detailed proof is provided in the appendix.

Step I: Reduce to the symmetric case, while maintaining incoherence of L\bm{L^{*}} and sparsity of S\bm{S^{*}}. Using standard symmetrization arguments, we can reduce the problem to the symmetric case, where all the matrices involved are symmetric. See appendix for details on this step.

Step II: Show decay in LL\|L-L^{*}\|_{\infty} after projection onto the set of rank-kk matrices. The tt-th iterate L(t+1)L^{(t+1)} of the kk-th stage is given by L(t+1)=Pk(L+SS(t))L^{(t+1)}=P_{k}(L^{*}+S^{*}-S^{(t)}). Hence, L(t+1)L^{(t+1)} is obtained by using the top principal components of a perturbation of LL^{*} given by L+(SS(t))L^{*}+(S^{*}-S^{(t)}). The key step in our analysis is to show that when an incoherent and low-rank LL^{*} is perturbed by a sparse matrix SS(t)S^{*}-S^{(t)}, then L(t+1)L\|L^{(t+1)}-L^{*}\|_{\infty} is small and is much smaller than SS(t)|S^{*}-S^{(t)}|_{\infty}. The following lemma formalizes the intuition; see the appendix for a detailed proof.

Let L,SL^{*},S^{*} be symmetric and satisfy the assumptions of Theorem 1 and let S(t)S^{(t)} and L(t)L^{(t)} be the ttht^{\textrm{th}} iterates of the kthk^{\textrm{th}} stage of Algorithm 1. Let σ1,,σn\sigma_{1}^{*},\dots,\sigma_{n}^{*} be the eigenvalues of LL^{*}, s.t., σ1σr|\sigma_{1}^{*}|\geq\dots\geq|\sigma_{r}^{*}|. Then, the following holds:

Moreover, the outputs L^\widehat{L} and S^\widehat{S} of Algorithm 1 satisfy:

Step III: Show decay in SS\|S-S^{*}\|_{\infty} after projection onto the set of sparse matrices. We next show that if L(t+1)L\|L^{(t+1)}-L^{*}\|_{\infty} is much smaller than S(t)S\|S^{(t)}-S^{*}\|_{\infty} then the iterate S(t+1)S^{(t+1)} also has a much smaller error (w.r.t. SS^{*}) than S(t)S^{(t)}. The above given lemma formally provides the error bound.

Experiments

We now present an empirical study of our AltProj method. The goal of this study is two-fold: a) establish that our method indeed recovers the low-rank and sparse part exactly, without significant parameter tuning, b) demonstrate that AltProj is significantly faster than Conv-RPCA (see (1)); we solve Conv-RPCA using the IALM method [CLMW11], a state-of-the-art solver [LCM10]. We implemented our method in Matlab and used a Matlab implementation of the IALM method by [LCM10].

We consider both synthetic experiments and experiments on real data involving the problem of foreground-background separation in a video. Each of our results for synthetic datasets is averaged over 55 runs.

Parameter Setting: Our pseudo-code (Algorithm 1) prescribes the threshold ζ\zeta in Step 4, which depends on the knowledge of the singular values of the low rank component LL^{*}. Instead, in the experiments, we set the threshold at the (t+1)(t+1)-th step of kk-th stage as ζ=μσk+1(MS(t))n\zeta=\frac{\mu\sigma_{k+1}(M-S^{(t)})}{\sqrt{n}}. For synthetic experiments, we employ the μ\mu used for data generation, and for real-world datasets, we tune μ\mu through cross-validation. We found that the above thresholding provides exact recovery while speeding up the computation significantly. We would also like to note that [CLMW11] sets the regularization parameter λ\lambda in Conv-RPCA (1) as 1/n1/\sqrt{n} (assuming mnm\leq n). However, we found that for problems with large incoherence such a parameter setting does not provide exact recovery. Instead, we set λ=μ/n\lambda=\mu/\sqrt{n} in our experiments.

There are three key problem parameters for RPCA with a fixed matrix size: a) sparsity of SS^{*}, b) incoherence of LL^{*}, c) rank of LL^{*}. We investigate performance of both AltProj and IALM by varying each of the three parameters while fixing the others. In our plots (see Figure 2), we report computational time required by each of the two methods for decomposing MM into L+SL+S up to a relative error (MLSF/MF\|M-L-S\|_{F}/\|M\|_{F}) of 10310^{-3}. Figure 2 shows that AltProj scales significantly better than IALM for increasingly dense SS^{*}. We attribute this observation to the fact that as S0\|S^{*}\|_{0} increases, the problem is “harder” and the intermediate iterates of IALM have ranks significantly larger than rr. Our intuition is confirmed by Figure 2 (b), which shows that when density (α\alpha) of SS^{*} is 0.40.4 then the intermediate iterates of IALM can be of rank over 500500 while the rank of LL^{*} is only 55. We observe a similar trend for the other parameters, i.e., AltProj scales significantly better than IALM with increasing incoherence parameter μ\mu (Figure 2 (c)) and increasing rank (Figure 2 (d)). See Appendix C for additional plots.

Real-world datasets: Next, we apply our method to the problem of foreground-background (F-B) separation in a video [LHGT04]. The observed matrix MM is formed by vectorizing each frame and stacking them column-wise. Intuitively, the background in a video is the static part and hence forms a low-rank component while the foreground is a dynamic but sparse perturbation.

Here, we used two benchmark datasets named Escalator and Restaurant dataset. The Escalator dataset has 34173417 frames at a resolution of 160×130160\times 130. We first applied the standard PCA method for extracting low-rank part. Figure 3 (b) shows the extracted background from the video. There are several artifacts (shadows of people near the escalator) that are not desirable. In contrast, both IALM and AltProj obtain significantly better F-B separation (see Figure 3(c), (d)). Interestingly, AltProj removes the steps of the escalator which are moving and arguably are part of the dynamic foreground, while IALM keeps the steps in the background part. Also, our method is significantly faster, i.e., our method, which takes 63.2s63.2s is about 2626 times faster than IALM, which takes 1688.9s1688.9s.

Restaurant dataset: Figure 4 shows the comparison of AltProj and IALM on a subset of the “Restaurant” dataset where we consider the last 20552055 frames at a resolution of 120×160120\times 160. AltProj was around 1919 times faster than IALM. Moreover, visually, the background extraction seems to be of better quality (for example, notice the blur near top corner counter in the IALM solution). Plot(b) shows the PCA solution and that also suffers from a similar blur at the top corner of the image, while the background frame extracted by AltProj does not have any noticeable artifacts.

Conclusion

In this work, we proposed a non-convex method for robust PCA, which consists of alternating projections on to low rank and sparse matrices. We established global convergence of our method under conditions which match those for convex methods. At the same time, our method has much faster running times, and has superior experimental performance. This work opens up a number of interesting questions for future investigation. While we match the convex methods, under the deterministic sparsity model, studying the random sparsity model is of interest. Our noisy recovery results assume deterministic noise; improving the results under random noise needs to be investigated. There are many decomposition problems beyond the robust PCA setting, e.g. structured sparsity models, robust tensor PCA problem, and so on. It is interesting to see if we can establish global convergence for non-convex methods in these settings.

Acknowledgements

AA and UN would like to acknowledge NSF grant CCF-1219234, ONR N00014-14-1-0665, and Microsoft faculty fellowship. SS would like to acknowledge NSF grants 1302435, 0954059, 1017525 and DTRA grant HDTRA1-13-1-0024. PJ would like to acknowledge Nikhil Srivastava and Deeparnab Chakrabarty for several insightful discussions during the course of the project.

References

Appendix A Proof of Theorem 1

We will start with some preliminary lemmas. The first lemma is the well known Weyl’s inequality in the matrix setting[Bha97].

Suppose B=A+EB=A+E be an n×nn\times n matrix. Let λ1,,λn\lambda_{1},\cdots,\lambda_{n} and σ1,,σn\sigma_{1},\cdots,\sigma_{n} be the eigenvalues of BB and AA respectively such that λ1λn\lambda_{1}\geq\cdots\geq\lambda_{n} and σ1σn\sigma_{1}\geq\cdots\geq{\sigma_{n}}. Then we have:

The following lemma is the Davis-Kahan theorem[Bha97], specialized for rank-11 matrices.

Suppose B=A+E{B}={A}+{E}. Let A=u(u){A}=\bm{u}^{*}{(\bm{u}^{*})}^{\top} be a rank-11 matrix with unit spectral norm. Suppose further that E2<12\left\lVert{E}\right\rVert_{2}<\frac{1}{2}. Then, we have:

where λ\lambda and u\bm{u} are the top eigenvalue eigenvector pair of B{B}.

As outlined in Section 3.1 (and formalized in the proof of Theorem 1), it is sufficient to prove the correctness of Algorithm 1 for the case of symmetric matrices. So, most of the lemmas we prove in this section assume that the matrices are symmetric.

Let x,yx,y be unit vectors such that S2=xTSy=ijxiyjSij\|S\|_{2}=x^{T}Sy=\sum_{ij}x_{i}y_{j}S_{ij}. Then, using ab(a2+b2)/2a\cdot b\leq(a^{2}+b^{2})/2, we have:

where the last inequality follows from the fact that SS has at most αn\alpha n non-zeros per row and per column. ∎

We prove the lemma using mathematical induction.

Base Case (p=0p=0): This is just a restatement of the incoherence of UU.

In what follows, we prove a number of lemmas concerning the structure of L(t)L^{(t)} and E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)}. The following lemma shows that the threshold in (2) is close to that with MS(t)M-S^{(t)} replaced by LL^{*}.

Let L,SL^{*},S^{*} be symmetric and satisfy the assumptions of Theorem 1 and let S(t)S^{(t)} be the ttht^{\textrm{th}} iterate of the kthk^{\textrm{th}} stage of Algorithm 1. Let σ1,,σr\sigma_{1}^{*},\dots,\sigma_{r}^{*} be the eigenvalues of LL^{*}, such that σ1σr|\sigma_{1}^{*}|\geq\dots\geq|\sigma_{r}^{*}| and λ1,,λn\lambda_{1},\cdots,\lambda_{n} be the eigenvalues of MS(t)M-S^{(t)} such that λ1λn\left\lvert\lambda_{1}\right\rvert\geq\cdots\geq\left\lvert\lambda_{n}\right\rvert. Recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)}. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|\right), and

Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Note that MS(t)=L+E(t)M-S^{(t)}=L^{*}+{E}^{(t)}. Now, using Lemmas 2 and 4, we have:

where γt:=(σk+1+(12)t1σk)\gamma_{t}:=\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|\right). That is, λk+1σk+18μ2rαγt\left\lvert|\lambda_{k+1}|-|\sigma_{k+1}^{*}|\right\rvert\leq 8\mu^{2}r\alpha\gamma_{t}. Similarly, λkσk8μ2rαγt\left\lvert|\lambda_{k}|-|\sigma_{k}^{*}|\right\rvert\leq 8\mu^{2}r\alpha\gamma_{t}. So we have:

where the last inequality follows from the bound α1512μ2r\alpha\leq\frac{1}{512\mu^{2}r}. ∎

Assume the notation of Lemma 6. Also, let L(t),S(t)L^{(t)},S^{(t)} be the ttht^{\textrm{th}} iterates of kthk^{\textrm{th}} stage of Algorithm 1 and L(t+1),S(t+1)L^{(t+1)},S^{(t+1)} be the (t+1)th(t+1)^{\textrm{th}} iterates of the same stage. Also, recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)} and E(t+1):=SS(t+1){E}^{(t+1)}:=S^{*}-S^{(t+1)}. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|\right), and

Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Let L(t+1)=Pk(MS(t))=UΛUL^{(t+1)}=P_{k}(M-S^{(t)})=U\Lambda{U}^{\top} be the eigenvalue decomposition of L(t+1)L^{(t+1)}. Also, recall that MS(t)=L+E(t)M-S^{(t)}=L^{*}+{E}^{(t)}. Then, for every eigenvector ui\bm{u}_{i} of L(t+1)L^{(t+1)}, we have

Note that we used Lemmas 2 and 4 to guarantee the existence of (IE(t)λi)1{\left(I-\frac{{E}^{(t)}}{\lambda_{i}}\right)}^{-1}. Hence,

We now bound the two terms on the right hand side above.

where we denote UΣ(U)U^{*}\Sigma^{*}{(U^{*})}^{\top} to be the SVD of LL^{*}. Let L+E(t)=UΛU+U~Λ~U~L^{*}+{E}^{(t)}=U\Lambda{U}^{\top}+\widetilde{U}\widetilde{\Lambda}{\widetilde{U}}^{\top} be the eigenvalue decomposition of L+E(t)L^{*}+{E}^{(t)}. Note that U~U=0{\widetilde{U}}^{\top}U=0. Recall that, UΛU=Pk(MS(t))=Pk(L+E(t))=L(t+1)U\Lambda{U}^{\top}=P_{k}(M-S^{(t)})=P_{k}(L^{*}+{E}^{(t)})=L^{(t+1)}. Also note that,

Now, we will bound the (p,q)th(p,q)^{\textrm{th}} term of p+q1(E(t))pLUΛ(p+q+1)UL(E(t))q\sum_{p+q\geq 1}\left\lVert\left({E}^{(t)}\right)^{p}L^{*}U\Lambda^{-(p+q+1)}{U}^{\top}L^{*}\left({E}^{(t)}\right)^{q}\right\rVert_{\infty}:

where ζ1\zeta_{1} follows from Lemma 5 and the incoherence of LL^{*}. Now, similar to (8), we have:

Using the above bound, and the assumption on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty}:

where we used Lemma 4 and the assumption on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty}. ∎

We used the following technical lemma in the proof of Lemma 7.

Assume the notation of Lemma 7. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|\right), and

Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

The result follows by using the bound on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty}. ∎

The following lemma bounds the support of E(t+1){E}^{(t+1)} and E(t+1)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}, using an assumption on L(t+1)L\left\lVert L^{(t+1)}-L^{*}\right\rVert_{\infty}.

Supp(E(t+1))Supp(S)\textrm{Supp}\left({E}^{(t+1)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

E(t+1)7μ2rn(σk+1+(12)tσk)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right), and

We first prove the first conclusion. Recall that,

where ζ=4μ2rn(λk+1+(12)tλk)\zeta=\frac{4\mu^{2}r}{n}\left(|\lambda_{k+1}|+\left(\frac{1}{2}\right)^{t}|\lambda_{k}|\right) is as defined in Algorithm 1 and λ1,,λn\lambda_{1},\cdots,\lambda_{n} are the eigenvalues of MS(t)M-S^{(t)} such that λ1λn\left\lvert\lambda_{1}\right\rvert\geq\cdots\geq\left\lvert\lambda_{n}\right\rvert.

We now prove the second conclusion. We consider the following two cases:

MijLij(t+1)>ζ\left\lvert M_{ij}-L^{(t+1)}_{ij}\right\rvert>\zeta: Here, Sij(t+1)=Sij+LijLij(t+1)S^{(t+1)}_{ij}=S^{*}_{ij}+L^{*}_{ij}-L^{(t+1)}_{ij}. Hence, Sij(t+1)SijLijLij(t+1)2μ2rn(σk+1+(12)tσk)|S^{(t+1)}_{ij}-S^{*}_{ij}|\leq|L^{*}_{ij}-L^{(t+1)}_{ij}|\leq\frac{2\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right).

MijLij(t+1)ζ\left\lvert M_{ij}-L^{(t+1)}_{ij}\right\rvert\leq\zeta: In this case, Sij(t+1)=0S^{(t+1)}_{ij}=0 and Sij+LijLij(t+1)ζ\left\lvert S^{*}_{ij}+L^{*}_{ij}-L^{(t+1)}_{ij}\right\rvert\leq\zeta. So we have, Eij(t+1)=Sijζ+LijLij(t+1)7μ2rn(σk+1+(12)tσk)\left\lvert{E}^{(t+1)}_{ij}\right\rvert=\left\lvert S^{*}_{ij}\right\rvert\leq\zeta+\left\lvert L^{*}_{ij}-L^{(t+1)}_{ij}\right\rvert\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right). The last inequality above follows from Lemma 6.

We are now ready to prove Lemma 1. In fact, we prove the following stronger version.

Recall that in the kthk^{\textrm{th}} stage, the update L(t+1)L^{(t+1)} is given by: L(t+1)=Pk(MS(t))L^{(t+1)}=P_{k}(M-S^{(t)}) and S(t+1)S^{(t+1)} is given by: S(t+1)=Hζ(ML(t+1))S^{(t+1)}=H_{\zeta}(M-L^{(t+1)}). Also, recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)} and E(t+1):=SS(t+1){E}^{(t+1)}:=S^{*}-S^{(t+1)}.

We prove the lemma by induction on both kk and tt. For the base case (k=1k=1 and t=1t=-1), we first note that the first inequality on L(0)L\left\lVert L^{(0)}-L^{*}\right\rVert_{\infty} is trivially satisfied. Due to the thresholding step (step 33 in Algorithm 1) and the incoherence assumption on LL^{*}, we have:

So the base case of induction is satisfied.

We first do the inductive step over tt (for a fixed kk). By inductive hypothesis we assume that: a) E(t)8μ2rn(σk+1+(12)t1σk)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|\right), b) Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right). Then by Lemma 7, we have:

E(t+1)8μ2rn(σk+1+(12)tσk)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right), and

Supp(E(t+1))Supp(S)\textrm{Supp}\left({E}^{(t+1)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

This finishes the induction over tt. Note that we show a stronger bound than necessary on E(t+1)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}.

We now do the induction over kk. Suppose the hypothesis holds for stage kk. Let TT denote the number of iterations in each stage. We first obtain a lower bound on TT. Since

we see that T10log(3μ2rσ1/ϵ)T\geq 10\log\left(3\mu^{2}r\left\lvert\sigma_{1}^{*}\right\rvert/\epsilon\right). So, at the end of stage kk, we have:

E(T)7μ2rn(σk+1+(12)Tσk)7μ2rσk+1n+ϵ10n\left\lVert{E}^{(T)}\right\rVert_{\infty}\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{T}|\sigma_{k}^{*}|\right)\leq\frac{7\mu^{2}r\left\lvert\sigma_{k+1}^{*}\right\rvert}{n}+\frac{\epsilon}{10n}, and

Supp(E(T))Supp(S)\textrm{Supp}\left({E}^{(T)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Lemmas 4 and 2 tell us that σk+1(MS(T))σk+1E(T)2α(7μ2rσk+1+ϵ)\left\lvert\sigma_{k+1}\left(M-S^{(T)}\right)-\left\lvert\sigma_{k+1}^{*}\right\rvert\right\rvert\leq\left\lVert{E}^{(T)}\right\rVert_{2}\leq\alpha\left(7\mu^{2}r\left\lvert\sigma_{k+1}^{*}\right\rvert+\epsilon\right). We will now consider two cases:

Algorithm 1 terminates: This means that βσk+1(MS(T))<ϵ2n\beta\sigma_{k+1}\left(M-S^{(T)}\right)<\frac{\epsilon}{2n} which then implies that σk+1<ϵ6μ2r\left\lvert\sigma_{k+1}^{*}\right\rvert<\frac{\epsilon}{6\mu^{2}r}. So we have:

This proves the statement about L^\widehat{L}. A similar argument proves the claim on S^S\left\lVert\widehat{S}-S^{*}\right\rVert_{\infty}. The claim on Supp(S^)\textrm{Supp}\left(\widehat{S}\right) follows since Supp(E(T))Supp(S)\textrm{Supp}\left({E}^{(T)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Algorithm 1 continues to stage (k+1)\bm{\left(k+1\right)}: This means that βσk+1(L(T))ϵ2n\beta\sigma_{k+1}\left(L^{(T)}\right)\geq\frac{\epsilon}{2n} which then implies that σk+1>ϵ8μ2r\left\lvert\sigma_{k+1}^{*}\right\rvert>\frac{\epsilon}{8\mu^{2}r}. So we have:

Similarly for L(T)L\left\lVert L^{(T)}-L^{*}\right\rVert_{\infty}.

Using Lemma 1, it suffices to show that the general case can be reduced to the case of symmetric matrices. We will now outline this reduction.

Recall that we are given an m×nm\times n matrix M=L+SM=L^{*}+S^{*} where LL^{*} is the true low-rank matrix and SS^{*} the sparse error matrix. Wlog, let mnm\leq n and suppose βmn<(β+1)m\beta m\leq n<(\beta+1)m, for some β1\beta\geq 1. We then consider the symmetric matrices

and S~=M~L~\widetilde{S}=\widetilde{M}-\widetilde{L}. A simple calculation shows that L~\widetilde{L} is incoherent with parameter 3μ\sqrt{3}\mu and S~\widetilde{S} satisfies the sparsity condition (S1) with parameter α2\frac{\alpha}{\sqrt{2}}. Moreover the iterates of AltProj with input M~\widetilde{M} have similar expressions as in (41) in terms of the corresponding iterates with input MM. This means that it suffices to obtain the same guarantees for Algorithm 1 for the symmetric case. Lemma 1 does precisely this, proving the theorem. ∎

Appendix B Proof of Theorem 2

In this section, we prove Theorem 2. The roadmap of the proofs in this section is essentially the same as that in Appendix A.

In what follows, we prove a number of lemmas concerning the structure of L(t)L^{(t)} and E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)}. The first lemma is a generalization of Lemma 6 and shows that the threshold in (2) is close to that with M(t)M^{(t)} replaced by LL^{*}.

Let L,S,NL^{*},S^{*},N^{*} be symmetric and satisfy the assumptions of Theorem 2 and let S(t)S^{(t)} be the ttht^{\textrm{th}} iterate of the kthk^{\textrm{th}} stage of Algorithm 1. Let σ1,,σr\sigma_{1}^{*},\dots,\sigma_{r}^{*} be the eigenvalues of LL^{*}, such that σ1σr|\sigma_{1}^{*}|\geq\dots\geq|\sigma_{r}^{*}| and λ1,,λn\lambda_{1},\cdots,\lambda_{n} be the eigenvalues of MS(t)M-S^{(t)} such that λ1λn\left\lvert\lambda_{1}\right\rvert\geq\cdots\geq\left\lvert\lambda_{n}\right\rvert. Recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)}. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk+7N2+8nrN)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), and

Supp(S(t))Supp(S)\textrm{Supp}\left(S^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Note that MS(t)=L+N+E(t)M-S^{(t)}=L^{*}+N^{*}+{E}^{(t)}. Now, using Lemmas 2 and 4, we have:

where γt=(σk+1+(12)t1σk+7N2+8nrN)\gamma_{t}=\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right). That is, λk+1σk+18μ2rαγt\left\lvert|\lambda_{k+1}|-|\sigma_{k+1}^{*}|\right\rvert\leq 8\mu^{2}r\alpha\gamma_{t}. Similarly, λkσk8μ2rαγt\left\lvert|\lambda_{k}|-|\sigma_{k}^{*}|\right\rvert\leq 8\mu^{2}r\alpha\gamma_{t}. So we have:

where the last inequality follows from the bound α1512μ2r\alpha\leq\frac{1}{512\mu^{2}r} and the assumption on N\left\lVert N^{*}\right\rVert_{\infty}. ∎

Assume the notation of Lemma 6. Also, let L(t),S(t)L^{(t)},S^{(t)} be the ttht^{\textrm{th}} iterates of kthk^{\textrm{th}} stage of Algorithm 1 and L(t+1),S(t+1)L^{(t+1)},S^{(t+1)} be the (t+1)th(t+1)^{\textrm{th}} iterates of the same stage. Also, recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)} and E(t+1):=SS(t+1){E}^{(t+1)}:=S^{*}-S^{(t+1)}. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk+7N2+8nrN)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), and

Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Let L(t+1)=Pk(MS(t))=UΛUL^{(t+1)}=P_{k}(M-S^{(t)})=U\Lambda{U}^{\top} be the eigenvalue decomposition of L(t+1)L^{(t+1)}. Also, recall that MS(t)=L+N+E(t)M-S^{(t)}=L^{*}+N^{*}+{E}^{(t)}. Then, for every eigenvector ui\bm{u}_{i} of L(t+1)L^{(t+1)}, we have

Note that we used Lemmas 2 and 4 to guarantee the existence of (IE(t)λi)1{\left(I-\frac{{E}^{(t)}}{\lambda_{i}}\right)}^{-1}. Hence,

We now bound the two terms on the right hand side above.

For the first term, we again use triangle inequality to obtain

where we denote UΣ(U)U^{*}\Sigma^{*}{(U^{*})}^{\top} to be the SVD of LL^{*}. Let L+N+E(t)=UΛU+U~Λ~U~L^{*}+N^{*}+{E}^{(t)}=U\Lambda{U}^{\top}+\widetilde{U}\widetilde{\Lambda}{\widetilde{U}}^{\top} be the eigenvalue decomposition of L+N+E(t)L^{*}+N^{*}+{E}^{(t)}. Note that U~U=0{\widetilde{U}}^{\top}U=0. Recall that, UΛU=Pk(M(t))=L(t)U\Lambda{U}^{\top}=P_{k}(M^{(t)})=L^{(t)}. Also note that,

Coming to the second term of (44), we have:

Using an expansion along the lines of (46), we see that

A similar argument as in (49) gives us the following bound on the last term in (44):

Next, we analyze p+q1(E(t))p(L+N)UΛ(p+q+1)U(L+N)(E(t))q\sum_{p+q\geq 1}\left\lVert({E}^{(t)})^{p}(L^{*}+N^{*})U\Lambda^{-(p+q+1)}{U}^{\top}(L^{*}+N^{*})({E}^{(t)})^{q}\right\rVert_{\infty}. This can again be bounded by four quantities:

where (ζ1)(\zeta_{1}) follows from Lemma 5 and the incoherence of LL^{*}. Now, similar to (46), we have:

where (ζ1)(\zeta_{1}) follows from Lemma 12 and the bound on N\left\lVert N^{*}\right\rVert_{\infty}.

Coming to the second term of (54), we have

where (ζ1)(\zeta_{1}) follows from Lemma 5 and incoherence of UU^{*}. Proceeding along the lines of (56), we obtain:

Plugging the above inequality along with (57) and (59) into (54) gives us:

Using the above bound, and the assumption on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty}:

where (ζ1)(\zeta_{1}) follows from Lemma 4, and (ζ2)(\zeta_{2}) follows from the assumption on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty}. ∎

We used the following technical lemma in the proof of Lemma 11.

Assume the notation of Lemma 11. Suppose further that

E(t)8μ2rn(σk+1+(12)t1σk+7N2+8nrN)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), and

Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Using the bound on E(t)\left\lVert{E}^{(t)}\right\rVert_{\infty} and recalling the assumption that

The following lemma bounds the support of E(t+1){E}^{(t+1)} and E(t+1)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}, using an assumption on L(t+1)L\left\lVert L^{(t+1)}-L^{*}\right\rVert_{\infty}.

Supp(E(t+1))Supp(S)\textrm{Supp}\left({E}^{(t+1)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

E(t+1)7μ2rn(σk+1+(12)tσk+7N2+8nrN)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), and

We first prove the first conclusion. Recall that,

where ζ=4μ2rn(λk+1+(12)tλk)\zeta=\frac{4\mu^{2}r}{n}\left(|\lambda_{k+1}|+\left(\frac{1}{2}\right)^{t}|\lambda_{k}|\right) is as defined in Algorithm 1 and λ1,,λn\lambda_{1},\cdots,\lambda_{n} are the eigenvalues of MS(t)M-S^{(t)} such that λ1λn\left\lvert\lambda_{1}\right\rvert\geq\cdots\geq\left\lvert\lambda_{n}\right\rvert.

We now prove the second conclusion. We consider the following two cases:

MijLij(t+1)>ζ\left\lvert M_{ij}-L^{(t+1)}_{ij}\right\rvert>\zeta: Here, Sij(t+1)=Sij+LijLij(t+1)+NijS^{(t+1)}_{ij}=S^{*}_{ij}+L^{*}_{ij}-L^{(t+1)}_{ij}+N^{*}_{ij}. Hence, Sij(t+1)SijLijLij(t+1)+Nij2μ2rn(σk+1+(12)tσk)+N|S^{(t+1)}_{ij}-S^{*}_{ij}|\leq|L^{*}_{ij}-L^{(t+1)}_{ij}|+\left\lvert N^{*}_{ij}\right\rvert\leq\frac{2\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right)+\left\lVert N^{*}\right\rVert_{\infty}.

MijLij(t+1)ζ\left\lvert M_{ij}-L^{(t+1)}_{ij}\right\rvert\leq\zeta: In this case, Sij(t+1)=0S^{(t+1)}_{ij}=0 and Sij+LijLij(t+1)+Nijζ\left\lvert S^{*}_{ij}+L^{*}_{ij}-L^{(t+1)}_{ij}+N^{*}_{ij}\right\rvert\leq\zeta. So we have, Eij(t+1)=Sijζ+LijLij(t+1)+Nij7μ2rn(σk+1+(12)tσk)+N\left\lvert{E}^{(t+1)}_{ij}\right\rvert=\left\lvert S^{*}_{ij}\right\rvert\leq\zeta+\left\lvert L^{*}_{ij}-L^{(t+1)}_{ij}\right\rvert+\left\lvert N^{*}_{ij}\right\rvert\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|\right)+\left\lVert N^{*}\right\rVert_{\infty}. The last inequality above follows from Lemma 6.

The following lemma is a generalization of Lemma 1.

Let L,S,NL^{*},S^{*},N^{*} be symmetric and satisfy the assumptions of Theorem 2 and let M(t)M^{(t)} and L(t)L^{(t)} be the ttht^{\textrm{th}} iterates of the kthk^{\textrm{th}} stage of Algorithm 1. Let σ1,,σn\sigma_{1}^{*},\dots,\sigma_{n}^{*} be the eigenvalues of LL^{*}, s.t., σ1σr|\sigma_{1}^{*}|\geq\dots\geq|\sigma_{r}^{*}|. Then, the following holds:

Moreover, the outputs L^\widehat{L} and S^\widehat{S} of Algorithm 1 satisfy:

Recall that in the kthk^{\textrm{th}} stage, the update L(t+1)L^{(t+1)} is given by: L(t+1)=Pk(MS(t))L^{(t+1)}=P_{k}(M-S^{(t)}) and S(t+1)S^{(t+1)} is given by: S(t+1)=Hζ(ML(t+1))S^{(t+1)}=H_{\zeta}(M-L^{(t+1)}). Also, recall that E(t):=SS(t){E}^{(t)}:=S^{*}-S^{(t)} and E(t+1):=SS(t+1){E}^{(t+1)}:=S^{*}-S^{(t+1)}.

We prove the lemma by induction on both kk and tt. For the base case (k=1k=1 and t=1t=-1), we first note that the first inequality on L(0)L\left\lVert L^{(0)}-L^{*}\right\rVert_{\infty} is trivially satisfied. Due to the thresholding step (step 33 in Algorithm 1) and the incoherence assumption on LL^{*}, we have:

So the base case of induction is satisfied.

We first do the inductive step over tt (for a fixed kk). By inductive hypothesis we assume that: a) E(t)8μ2rn(σk+1+(12)t1σk+7N2+8nrN)\left\lVert{E}^{(t)}\right\rVert_{\infty}\leq\frac{8\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t-1}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), b) Supp(E(t))Supp(S)\textrm{Supp}\left({E}^{(t)}\right)\subseteq\textrm{Supp}\left(S^{*}\right). Then by Lemma 11, we have:

E(t+1)7μ2rn(σk+1+(12)tσk+7N2+8nrN)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{t}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right), and

Supp(E(t+1))Supp(S)\textrm{Supp}\left({E}^{(t+1)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

This finishes the induction over tt. Note that we show a stronger bound than necessary on E(t+1)\left\lVert{E}^{(t+1)}\right\rVert_{\infty}.

We now do the induction over kk. Suppose the hypothesis holds for stage kk. Let TT denote the number of iterations in each stage. We first obtain a lower bound on TT. Since

we see that T10log(3μ2rσ1/ϵ)T\geq 10\log\left(3\mu^{2}r\left\lvert\sigma_{1}^{*}\right\rvert/\epsilon\right). So, at the end of stage kk, we have:

E(T)7μ2rn(σk+1+(12)Tσk+7N2+8nrN)7μ2rσk+1n+ϵ10n\left\lVert{E}^{(T)}\right\rVert_{\infty}\leq\frac{7\mu^{2}r}{n}\left(|\sigma_{k+1}^{*}|+\left(\frac{1}{2}\right)^{T}|\sigma_{k}^{*}|+7\left\lVert N^{*}\right\rVert_{2}+\frac{8n}{\sqrt{r}}\left\lVert N^{*}\right\rVert_{\infty}\right)\leq\frac{7\mu^{2}r\left\lvert\sigma_{k+1}^{*}\right\rvert}{n}+\frac{\epsilon}{10n}, and

Supp(E(T))Supp(S)\textrm{Supp}\left({E}^{(T)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Lemmas 4 and 2 tell us that σk+1(MS(T))σk+1E(T)2α(7μ2rσk+1+ϵ)\left\lvert\sigma_{k+1}\left(M-S^{(T)}\right)-\left\lvert\sigma_{k+1}^{*}\right\rvert\right\rvert\leq\left\lVert{E}^{(T)}\right\rVert_{2}\leq\alpha\left(7\mu^{2}r\left\lvert\sigma_{k+1}^{*}\right\rvert+\epsilon\right). We will now consider two cases:

Algorithm 1 terminates: This means that βσk+1(MS(T))<ϵ2n\beta\sigma_{k+1}\left(M-S^{(T)}\right)<\frac{\epsilon}{2n} which then implies that σk+1<ϵ6μ2r\left\lvert\sigma_{k+1}^{*}\right\rvert<\frac{\epsilon}{6\mu^{2}r}. So we have:

This proves the statement about L^\widehat{L}. A similar argument proves the claim on S^S\left\lVert\widehat{S}-S^{*}\right\rVert_{\infty}. The claim on Supp(S^)\textrm{Supp}\left(\widehat{S}\right) follows since Supp(E(T))Supp(S)\textrm{Supp}\left({E}^{(T)}\right)\subseteq\textrm{Supp}\left(S^{*}\right).

Algorithm 1 continues to stage (k+1)\bm{\left(k+1\right)}: This means that βσk+1(L(T))ϵ2n\beta\sigma_{k+1}\left(L^{(T)}\right)\geq\frac{\epsilon}{2n} which then implies that σk+1>ϵ8μ2r\left\lvert\sigma_{k+1}^{*}\right\rvert>\frac{\epsilon}{8\mu^{2}r}. So we have:

Similarly for L(T)L\left\lVert L^{(T)}-L^{*}\right\rVert_{\infty}.

Using Lemma 14, it suffices to show that the general case can be reduced to the case of symmetric matrices. We will now outline this reduction.

Recall that we are given an m×nm\times n matrix M=L+N+SM=L^{*}+N^{*}+S^{*} where LL^{*} is the true low-rank matrix, NN^{*} dense corruption matrix and SS^{*} the sparse error matrix. Wlog, let mnm\leq n and suppose βmn<(β+1)m\beta m\leq n<(\beta+1)m, for some β1\beta\geq 1. We then consider the symmetric matrices

and S~=M~L~\widetilde{S}=\widetilde{M}-\widetilde{L}. A simple calculation shows that L~\widetilde{L} is incoherent with parameter 3μ\sqrt{3}\mu, N~\widetilde{N} satisfies the assumption of Theorem 2 and S~\widetilde{S} satisfies the sparsity condition (S1) with parameter α2\frac{\alpha}{\sqrt{2}}. Moreover the iterates of AltProj with input M~\widetilde{M} have similar expressions as in (102) in terms of the corresponding iterates with input MM. This means that it suffices to obtain the same guarantees for Algorithm 1 for the symmetric case. Lemma 14 does precisely this, proving the theorem. ∎

Appendix C Additional experimental results

Extending Figure 2, the plots in Figure 5 illustrate the point that soft thresholding, i.e., the convex relaxation approach, leads to intermediate solutions with high ranks. Figures 5 (a)-(b) show the variation of the maximum rank of the intermediate low-rank solutions of IALM with rank and incoherence respectively; the results are averaged over 5 runs of the algorithm; we note that as the problem becomes harder, the maximum intermediate rank via soft thresholding (convex approach) increases, and this leads to higher running times. As an example of this phenomenon, Figure 5 (c) shows the rank of the intermediate iterates of IALM for a particular run with n=2000,r=10,α=100/n,μ=3n=2000,r=10,\alpha=100/n,\mu=3; here, while the rank of the final output is 1010, intermediate iterates have a rank as high as 800800. We run our synthetic simulations on a machine with Intel Dual 8-core Xeon (E5-2650) 2.0GHz CPU with 192GB RAM.

Real-world datasets:

We provide some additional results concerning foreground-background separation in videos The datasets are available at http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html. We compare NcRPCA with IALM, and also with the low-rank solution obtained using vanilla PCA; we report the solutions obtained by NcRPCA and IALM methods for decomposing MM into L+SL+S up to a relative error (MLSF/MF\|M-L-S\|_{F}/\|M\|_{F}) of 10310^{-3}. We report the rank and the sparsity of the solutions obtained by the two methods along with the computational time. As mentioned before, the observed matrix MM is formed by vectorizing each frame and stacking them column-wise. For illustration purposes, we arbitrarily select one of the original frames in the sequence of image frames obtained from the video, i.e., one of the columns of MM, and the corresponding columns in LL and SS obtained using NcRPCA and IALM. We run our real data experiments on a machine with Intel Dual 8-core Xeon (E5-2650) 2.0GHz CPU with 192GB RAM.

Shopping Mall dataset: Figure 6 shows the comparison of NcRPCA and IALM on the “Shopping Mall” dataset which has 12861286 frames at a resolution of 256×320256\times 320. NcRPCA achieves a solution of better visual quality (for example, unlike NcRPCA, notice the artifact of the low-rank solution from IALM in the top right corner of the image where the person is walking over the reflection of a light source; also notice the shadows of people in the low-rank part obtained by IALM which are not present in the low-rank solution obtained by NcRPCA), in 292.1s292.1s, compared to IALM, which takes 783.4s783.4s until convergence. NcRPCA obtains a rank 2020 solution for LL with S0=95411896\|S\|_{0}=95411896 whereas IALM obtains a rank 286286 solution for LL with S0=86253965\|S\|_{0}=86253965.

Curtain dataset: We illustrate our recovery on one of the frames (frame 27732773) wherein a person enters a room with a curtain on the background. Figure 7 shows the comparison of NcRPCA and IALM on the “Curtain” dataset which has 29642964 frames at a resolution of 160×128160\times 128. NcRPCA achieves a solution, in 39.5s39.5s, which is of similar visual quality to that of IALM, which takes 989.0s989.0s until convergence. NcRPCA obtains a rank 11 solution for LL with S0=53897769\|S\|_{0}=53897769 whereas IALM obtains a rank 701701 solution for LL with S0=42310582\|S\|_{0}=42310582.