An Improved Approximation Algorithm for the Column Subset Selection Problem

Christos Boutsidis, Michael W. Mahoney, Petros Drineas

Introduction

We consider the problem of selecting the “best” set of exactly kk columns from an m×nm\times n matrix AA. More precisely, we consider the following Column Subset Selection Problem (CSSP):

is minimized over all possible (nk){n\choose k} choices for the matrix CC. Here, PC=CC+P_{C}=CC^{+} denotes the projection onto the kk-dimensional space spanned by the columns of CC and ξ=2 \mboxor F\xi=2\ \mbox{or}\ F denotes the spectral norm or Frobenius norm.

That is, the goal of the CSSP is to find a subset of exactly kk columns of AA that “captures” as much of AA as possible, with respect to the spectral norm and/or Frobenius norm, in a projection sense. The CSSP has been studied extensively in numerical linear algebra, where it has found applications in, e.g., scientific computing . More recently, a relaxation has been studied in theoretical computer science, where it has been motivated by applications to large scientific and internet data sets .

We briefly comment on the complexity of the problem. Clearly, in O(nk)O(n^{k}) time we can generate all possible matrices CC and thus find the optimal solution in O(nkmnk)O(n^{k}mnk) time. However, from a practical perspective, in data analysis applications of the CSSP (see Section 1.2), nn is often in the order of hundreds or thousands. Thus, in practice, algorithms whose running time depends exponentially on kk are prohibitively slow even if kk is, from a theoretical perspective, a constant. Finally, the NP-hardness of the CSSP (assuming kk is a function of nn) is an open problem. Note, though, that a similar problem, asking for the kk columns of the m×nm\times n matrix AA that maximize the volume of the parallelepiped spanned by the columns of CC, is provably NP-hard .

2 The CSSP in statistical data analysis

In data applications, where the input matrix AA models mm objects represented with respect to nn features, the CSSP corresponds to unsupervised feature selection. Standard motivations for feature selection include facilitating data visualization, reducing training times, avoiding overfitting, and facilitating data understanding.

Consider, in particular, Principal Components Analysis (PCA), which is the predominant linear dimensionality reduction technique, and which has been widely applied on datasets in all scientific domains, from the social sciences and economics, to biology and chemistry. In words, PCA seeks to map or embed data points from a high dimensional Euclidean space to a low dimensional Euclidean space while keeping all the relevant linear structure intact. PCA is an unsupervised dimensionality reduction technique, with the sole input parameters being the coordinates of the data points and the number of dimensions that will be retained in the embedding (say kk), which is typically a constant independent of mm and nn; often it is k{m,n}k\ll\{m,n\} too. Data analysts often seek a subset of kk actual features (that is, kk actual columns, as opposed to the kk eigenvectors or eigenfeatures returned by PCA) that can accurately reproduce the structure derived by PCA. The CSSP is the obvious optimization problem associated with such unsupervised feature selection tasks.

We should note that similar formulations appeared in . In addition, applications of such ideas include: (ii) , where a “compact CUR matrix decomposition” was applied to static and dynamic data analysis in large sparse graphs; (iiii) , where these ideas were used for compression and classification of hyperspectral medical data and the reconstruction of missing entries from recommendation systems data in order to make high-quality recommendations; and (iiiiii) , where the concept of “PCA-correlated SNPs” (Single Nucleotide Polymorphisms) was introduced and applied to classify individuals from throughout the world without the need for any prior ancestry information. See for a detailed evaluation of our main algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis (finance, document-term data, and genetics).

3 Our main results

We present a novel two-stage algorithm for the CSSP. This algorithm is presented in detail in Section 3 as Algorithm 1. In the first stage of this algorithm (the randomized stage), we randomly select Θ(klogk)\Theta(k\log k) columns of VkTV_{k}^{T}, i.e., of the transpose of the n×kn\times k matrix consisting of the top kk right singular vectors of AA, according to a judiciously-chosen probability distribution that depends on information in the top-kk right singular subspace of AA. Then, in the second stage (the deterministic stage), we apply a deterministic column-selection procedure to select exactly kk columns from the set of columns of VkTV_{k}^{T} selected by the first stage. The algorithm then returns the corresponding kk columns of AA. In Section 4 we prove the following theorem.

There exists an algorithm (the two-stage Algorithm 1) that approximates the solution to the CSSP. This algorithm takes as input an m×nm\times n matrix AA and a positive integer kk; it runs in O(min{mn2,m2n})O(\min\{mn^{2},m^{2}n\}) time; and it returns as output an m×km\times k matrix CC consisting of exactly kk columns of AA such that with probability at least 0.80.8:

Here, PC=CC+P_{C}=CC^{+} denotes a projection onto the column span of the matrix CC, and AkA_{k} denotes the best rank-kk approximation to the matrix AA as computed with the singular value decomposition.

Note that we can trivially boost the success probability in the above theorem to 1δ1-\delta by repeating the algorithm O(log(1/δ))O\left(\log\left(1/\delta\right)\right) times. Note also that the running time of our algorithm is linear in the larger of the dimensions mm and nn, quadratic in the smaller one, and independent of kk. Thus, it is practically useful and efficient.

To put our results into perspective, we compare them to the best existing results for the CSSP. Prior work provided bounds of the form

where p(k,n)p(k,n) is a polynomial on nn and kk. For ξ=2\xi=2, i.e., for the spectral norm, the best previously-known bound for approximating the CSSP is p(k,n)=Θ(k(nk)+1)p(k,n)=\Theta\left(\sqrt{k(n-k)+1}\right) , while for ξ=F\xi=F, i.e., for the Frobenius norm, the best bound is p(k,n)=(k+1)!p(k,n)=\sqrt{(k+1)!} . Both results are algorithmically efficient, running in time polynomial in all three parameters mm, nn, and kk. The former runs in O(mnklogn)O(mnk\log n) time and the latter runs in O(mnk+kn)O(mnk+kn) time. Our approach provides an algorithmic bound for the Frobenius norm version of the CSSP that is roughly O(k!)O(\sqrt{k!}) better than the best previously-known algorithmic result. It should be noted that also proves that by exhaustively testing all (nk){n\choose k} possibilities for the matrix CC, the best one will satisfy eqn. (1) with p(k,n)=k+1p(k,n)=\sqrt{k+1}. Our algorithmic result is only O(klogk)O(\sqrt{k\log k}) worse than this existential result. A similar existential result for the spectral norm version of the CSSP is proved in with p(k,n)=k(nk)+1p(k,n)=\sqrt{k(n-k)+1}. Our spectral norm bound depends on Θ(k3/4log1/4k)\mboxAAkF\Theta\left(k^{3/4}\log^{1/4}k\right)\mbox{}\left\|A-A_{k}\right\|_{F}. In a worst case setting (e.g., when all the bottom nk+1n-k+1 singular values of AA are equal) this quantity is upper bounded by Θ((nk)1/2k3/4log1/4k)\mboxAAk2\Theta\left(\left(n-k\right)^{1/2}k^{3/4}\log^{1/4}k\right)\mbox{}\left\|A-A_{k}\right\|_{2}. This is worse than the best results for the spectral norm version of the CSSP by a factor of Θ(k1/4log1/4k)\Theta\left(k^{1/4}\log^{1/4}k\right).

Finally, we should emphasize that a novel feature of the algorithm that we present in this paper is that it combines in a nontrivial manner recent algorithmic developments in the theoretical computer science community with more traditional techniques from the numerical linear algebra community in order to obtain improved bounds for the CSSP.

Background and prior work

First, recall that the Θ\Theta-notation can be used to denote an asymptotically tight bound: f(n)Θ(g(n))f(n)\in\Theta(g(n)) or f(n)=Θ(g(n))f(n)=\Theta(g(n)) if there exist positive constants c1c_{1}, c2c_{2}, and n0n_{0} such that 0c1g(n)f(n)c2g(n)0\leq c_{1}g(n)\leq f(n)\leq c_{2}g(n) for all nn0n\geq n_{0}. This is similar to the way in which the big-OO-notation can be used to denote an asymptotic upper bound: f(n)=O(g(n))f(n)=O(g(n)) if there exist positive constants cc and n0n_{0} such that 0f(n)cg(n)0\leq f(n)\leq cg(n) for all nn0n\geq n_{0}.

Finally, we will make frequent use of the following fundamental result from probability theory, known as Markov’s inequality . Let XX be a random variable assuming non-negative values with expectation \mboxE[X]\mbox{}{\bf{E}}\left[X\right]. Then, for all t>0t>0, Xt\mboxE[X]X\leq t\cdot\mbox{}{\bf{E}}\left[X\right] with probability at least 1t11-t^{-1}. We will also need the so-called union bound. Given a set of probabilistic events E1,E2,,En{\cal E}_{1},{\cal E}_{2},\ldots,{\cal E}_{n} holding with respective probabilities p1,p2,,pnp_{1},p_{2},\ldots,p_{n}, the probability that all events hold (a.k.a., the probability of the union of those events) is upper bounded by i=1npi\sum_{i=1}^{n}p_{i}.

2 Related prior work

Since solving the CSSP exactly is a hard combinatorial optimization problem, research has historically focused on computing approximate solutions to it. Since \mboxAAkξ\mbox{}\left\|A-A_{k}\right\|_{\xi} provides an immediate lower bound for \mboxAPCAξ\mbox{}\left\|A-P_{C}A\right\|_{\xi}, for ξ=2,F\xi=2,F and for any choice of CC, a large number of approximation algorithms have been proposed to select a subset of kk columns of AA such that the resulting matrix CC satisfies

for some function p(k,n)p(k,n). Within the numerical linear algebra community, most of the work on the CSSP has focused on spectral norm bounds and is related to the so-called Rank Revealing QR (RRQR) factorization:

(The RRQR factorization) Given a matrix ARm×nA\in R^{m\times n} (mnm\geq n) and an integer kk (knk\leq n), assume partial QRQR factorizations of the form:

where QRm×nQ\in R^{m\times n} is an orthonormal matrix, RRn×nR\in R^{n\times n} is upper triangular, R11Rk×kR_{11}\in R^{k\times k}, R12Rk×(nk)R_{12}\in R^{k\times(n-k)}, R22R(nk)×(nk)R_{22}\in R^{(n-k)\times(n-k)}, and ΠRn×n\Pi\in R^{n\times n} is a permutation matrix. The above factorization is called a RRQR factorization if it satisfies

where p1(k,n)p_{1}(k,n) and p2(k,n)p_{2}(k,n) are functions bounded by low degree polynomials in kk and nn.

The work of Golub on pivoted QRQR factorizations was followed by much research addressing the problem of constructing an efficient RRQR factorization. Most researchers improved RRQR factorizations by focusing on improving the functions p1(k,n)p_{1}(k,n) and p2(k,n)p_{2}(k,n) in Definition 2. Let Πk\Pi_{k} denote the first kk columns of a permutation matrix Π\Pi. Then, if C=AΠkC=A\Pi_{k} is an m×km\times k matrix consisting of kk columns of AA, it is straightforward to prove that

for both ξ=2,F\xi=2,F. Thus, in particular, when applied to the spectral norm, it follows that

i.e., any algorithm that constructs an RRQR factorization of the matrix AA with provable guarantees also provides provable guarantees for the CSSP. See Table 1 for a summary of existing results, and see for a survey and an empirical evaluation of some of these algorithms. More recently, proposed random-projection type algorithms that achieve the same spectral norm bounds as prior work while improving the running time.

Within the theoretical computer science community, much work has followed that of Frieze, Kannan, and Vempala on selecting a small subset of representative columns of AA, forming a matrix CC, such that the projection of AA on the subspace spanned by the columns of CC is as close to AA as possible. The algorithms from this community are randomized, which means that they come with a failure probability, and focus mainly on the Frobenius norm. It is worth noting that they provide a strong tradeoff between the number of selected columns and the desired approximation accuracy. A typical scenario for these algorithms is that the desired approximation error (see ϵ\epsilon below) is given as input, and then the algorithm selects the minimum number of appropriate columns in order to achieve this error. One of the most relevant results for this paper is a bound of , which states that there exist exactly kk columns in any m×nm\times n matrix AA such that

Here, CC contains exactly kk columns of AA. The only known algorithm to find these kk columns is to try all (nk){n\choose k} choices and keep the best. This existential result relies on the so-called volume sampling method . In , an adaptive sampling method is used to approximate the volume sampling method and leads to an O(mnk+kn)O(mnk+kn) algorithm which finds kk columns of AA such that

As mentioned above, much work has also considered algorithms choosing slightly more than kk columns. This relaxation provides significant flexibility and improved error bounds. For example, in , an adaptive sampling method leads to an O(mn(k/ϵ2+k2logk))O\left(mn\left(k/\epsilon^{2}+k^{2}\log k\right)\right) algorithm, such that

holds with high probability for some matrix CC consisting of Θ(k/ϵ2+k2logk)\Theta\left(k/\epsilon^{2}+k^{2}\log k\right) columns of AA. Similarly, in , Drineas, Mahoney, and Muthukrishnan leverage the subspace sampling method to give an O(min{mn2,m2n})O(\min\{mn^{2},m^{2}n\}) algorithm such that

holds with high probability if CC contains Θ(klogk/ϵ2)\Theta(k\log k/\epsilon^{2}) columns of AA.

A two-stage algorithm for the CSSP

In this section, we present and describe Algorithm 1, our main algorithm for approximating the solution to the CSSP. This algorithm takes as input an m×nm\times n matrix AA and a rank parameter kk. After an initial setup, the algorithm has two stages: a randomized stage and a deterministic stage. In the randomized stage, a randomized procedure is run to select Θ(klogk)\Theta\left(k\log k\right) columns from the k×nk\times n matrix VkTV_{k}^{T}, i.e., the transpose of the matrix containing the top-kk right singular vectors of AA. The columns are chosen by randomly sampling according to a judiciously-chosen nonuniform probability distribution that depends on information in the top-kk right singular subspace of AA. Then, in the deterministic stage, a deterministic procedure is employed to select exactly kk columns from the Θ(klogk)\Theta\left(k\log k\right) columns chosen in the randomized stage. The algorithm then outputs exactly kk columns of AA that correspond to those columns chosen from VkTV_{k}^{T}. Theorem 1 states that the projection of AA on the subspace spanned by these kk columns of AA is (up to bounded error) close to the best rank kk approximation to AA.

In more detail, Algorithm 1 first computes a probability distribution p1,p2,,pnp_{1},p_{2},\ldots,p_{n} over the set {1,,n}\{1,\ldots,n\}, i.e., over the columns of VkTV_{k}^{T}, or equivalently over the columns of AA. The probability distribution depends on information in the top-kk right singular subspace of AA. In particular, for all i[n]i\in[n], define

and note that pi0p_{i}\geq 0, for all i[n]i\in[n], and that i=1npi=1\sum_{i=1}^{n}p_{i}=1. We will describe the computation of probabilities of this form below.

In the randomized stage, Algorithm 1 employs the following randomized column selection algorithm to choose Θ(klogk)\Theta(k\log k) columns from VkTV_{k}^{T} to pass to the second stage. Let cc assume the value of eqn. (11). In cc independent identically distributed (i.i.d.) trials, the algorithm chooses a column of VkTV_{k}^{T} where in each trial the ii-th column of VkTV_{k}^{T} is kept with probability pip_{i}. Additionally, if the ii-th column is kept, then a scaling factor equal to 1/cpi1/\sqrt{cp_{i}} is kept as well. Thus, at the end of this process, we will be left with cc columns of VkTV_{k}^{T} and their corresponding scaling factors. Notice that due to random sampling in i.i.d. trials with replacement we might keep a particular column more than once.

In order to conveniently represent the cc selected columns and the associated scaling factors, we will use the following sampling matrix formalism. First, define an n×cn\times c sampling matrix S1S_{1} as follows: S1S_{1} is initially empty; at each of the cc i.i.d. trials, if the ii-th column of VkTV_{k}^{T} is selected by the random sampling process, then eie_{i} (an nn-vector of all-zeros, except for its ii-th entry which is set to one) is appended to S1S_{1}. Next, define the c×cc\times c diagonal rescaling matrix D1D_{1} as follows: if the ii-th column of VkTV_{k}^{T} is selected, then a diagonal entry of D1D_{1} is set to 1/cpi1/\sqrt{cp_{i}}. Thus, we may view the randomized stage as outputting the matrix VkTS1D1V_{k}^{T}S_{1}D_{1} consisting of a small number of rescaled columns of VkTV_{k}^{T}, or simply as outputting S1S_{1} and D1D_{1}.

In the deterministic stage, Algorithm 1 applies a deterministic column selection algorithm to the output of the first stage in order to choose exactly kk columns from the input matrix AA. To do so, we run the Algorithm 4 of (with the parameter ff set to 2\sqrt{2}) on the k×ck\times c matrix VkTS1D1V_{k}^{T}S_{1}D_{1}, i.e., the column-scaled version of the columns of VkTV_{k}^{T} chosen in the first stage. Thus, a matrix VkTS1D1S2V_{k}^{T}S_{1}D_{1}S_{2} is formed, or equivalently, in the sampling matrix formalism described previously, a new matrix S2S_{2} is constructed. Its dimensions are c×kc\times k, since it selects exactly kk columns out of the cc columns returned after the end of the randomized stage. The algorithm then returns the corresponding kk columns of the original matrix AA, i.e., after the second stage of the algorithm is complete, the m×km\times k matrix C=AS1S2C=AS_{1}S_{2} is returned as the final output.

2 Running time analysis

We now discuss the running time of our algorithm. Note that manipulating the probability distribution of eqn. (10) yields:

Thus, knowledge of VkV_{k}, i.e., the n×kn\times k matrix consisting of the top-kk right singular vectors of AA, suffices to compute the pip_{i}’s. By eqn. (12), O(min{mn2,m2n})O(\min\{mn^{2},m^{2}n\}) time suffices for our theoretical analysis. In practice iterative algorithms could be used to speed up the algorithm. Note also that in order to obtain the Frobenius norm bound of Theorem 1, our theoretical analysis holds if the sampling probabilities are of the form:

That is, the Frobenius norm bound of Theorem 1 holds even if the second term in the sampling probabilities of eqns. (10) and (12) is omitted.

which is the only guarantee that we need in the deterministic step (see Lemma 3). The running time of the deterministic stage of Algorithm 1 is O(c2klogc)O(c^{2}k\log\sqrt{c}) time, since the (padded) matrix VkTS1D1V_{k}^{T}S_{1}D_{1} has cc columns and rows.

An important open problem would be to identify other suitable importance sampling probability distributions that avoid the computation of a basis for the top-kk right singular subspace.

3 Intuition underlying our main algorithm

Intuitively, we achieve improved bounds for the CSSP because we apply the deterministic algorithm to a lower dimensional matrix (the matrix VkTS1D1V_{k}^{T}S_{1}D_{1} with Θ(klogk)\Theta\left(k\log k\right) columns, as opposed to the matrix AA with nn columns) in which the columns are “spread out” in a “nice” manner. To see this, note that the probability distribution of eqn. (13), and thus one of the two terms in the probability distribution of eqns. (10) or (12), equals (up to scaling) the diagonal elements of the projection matrix onto the span of the top-kk right singular subspace. In diagnostic regression analysis, these quantities have a natural interpretation in terms of statistical leverage, and thus they have been used extensively to identify “outlying” data points . Thus, the importance sampling probabilities that we employ in the randomized stage of our main algorithm provide a bias toward more “outlying” columns, which then provide a “nice” starting point for the deterministic stage of our main algorithm. This also provides intuition as to why using importance sampling probabilities of the form of eqn. (13) leads to relative-error low-rank matrix approximations .

Proof of Theorem 1

In this section, we provide a proof of Theorem 1. We start with an outline of our proof, pointing out conceptual improvements that were necessary in order to obtain improved bounds. An important condition in the first phase of the algorithm is that when we sample columns from the k×nk\times n matrix VkTV_{k}^{T}, we obtain a k×ck\times c matrix VkTS1D1V_{k}^{T}S_{1}D_{1} that does not lose any rank. To do so, we will apply a result from matrix perturbation theory to prove that if c=Θ(klogk)c=\Theta(k\log k) (see eqn. (11)) then σk2(VkTS1D1)11/2\left|\sigma_{k}^{2}\left(V_{k}^{T}S_{1}D_{1}\right)-1\right|\leq 1/2. (See Lemma 1 below.) Then, under the assumption that VkTS1D1V_{k}^{T}S_{1}D_{1} has full rank, we will prove that the m×km\times k matrix CC returned by the algorithm will satisfy:

for both ξ=2,F\xi=2,F. (See Lemma 2 below.) Next, we will provide a bound on σk1(VkTS1D1S2)\sigma_{k}^{-1}\left(V_{k}^{T}S_{1}D_{1}S_{2}\right). In order to get a strong accuracy guarantee for the overall algorithm, the deterministic column selection algorithm must satisfy

where p(k,c)p(k,c) is a polynomial in both kk and cc. Thus, for our main theorem, we will employ Algorithm 4 with f=2f=\sqrt{2}, which guarantees the above bound with p(k,c)=2k(ck)+1p(k,c)=\sqrt{2k\left(c-k\right)+1}. (See Lemma 3 below.) Finally, we will show, using relatively straightforward matrix perturbation techniques, that \mboxΣρkVρkTS1D1ξ\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{\xi} is not too much more, in a multiplicative sense, than \mboxAAkξ\mbox{}\left\|A-A_{k}\right\|_{\xi}, where we note that the factors differ for ξ=2,F\xi=2,F. (See Lemmas 4 and 5 below.) By combining these results, the main theorem will follow.

The following lemma provides a bound on the singular values of the matrix VkTS1D1V_{k}^{T}S_{1}D_{1} computed by the randomized phase of Algorithm 1, from which it will follow that the matrix VkTS1D1V_{k}^{T}S_{1}D_{1} is full rank. To prove the lemma, we employ Theorem 2 of the Appendix (this theorem is a variant of a result of Rudelson and Vershynin in ). Note that probabilities of the form of eqn. (13) actually suffice to establish Lemma 1.

Let S1S_{1} and D1D_{1} be constructed using Algorithm 1. Then, with probability at least 0.90.9,

In particular, VkTS1D1V_{k}^{T}S_{1}D_{1} has full rank.

Proof: In order to bound σk(VkTS1D1)\sigma_{k}\left(V_{k}^{T}S_{1}D_{1}\right), we will bound \mboxVkTS1D1D1S1TVkIk2\mbox{}\left\|V_{k}^{T}S_{1}D_{1}D_{1}S_{1}^{T}V_{k}-I_{k}\right\|_{2}. Towards that end, we will use Theorem 2 with β=1/2\beta=1/2 and ϵ=1/20\epsilon=1/20, which results in the value for cc in eqn. (11). Note that the sampling probabilities in eqn. (12) satisfy

Now Theorem 2 and our construction of S1S_{1} and D1D_{1} guarantee that for cc as in eqn. (11)

We note here that the condition c02\mboxVkF24βϵ2c_{0}^{2}\mbox{}\left\|V_{k}\right\|_{F}^{2}\geq 4\beta\epsilon^{2} in Theorem 2 is trivially satisfied assuming that c0c_{0} is at least one (given our choices for β\beta, ϵ\epsilon, and \mboxVkF2=k1\mbox{}\left\|V_{k}\right\|_{F}^{2}=k\geq 1). Using VkTVk=IkV_{k}^{T}V_{k}=I_{k} and Markov’s inequality we get that with probability at least 0.90.9,

Standard matrix perturbation theory results now imply that for all i=1,,ki=1,\ldots,k,

Let S1S_{1}, D1D_{1}, and S2S_{2} be constructed as described in Algorithm 1 and recall that C=AS1S2C=AS_{1}S_{2}. If VkTS1D1V_{k}^{T}S_{1}D_{1} has full rank, then for ξ=2,F\xi=2,F,

Proof: We seek to bound the spectral and Frobenius norms of APCAA-P_{C}A, where C=AS1S2C=AS_{1}S_{2} is constructed by Algorithm 1. To do so, first notice that scaling the columns of a matrix (equivalently, post-multiplying the matrix by a diagonal matrix) by any non-zero scale factors does not change the subspace spanned by the columns of the matrix. Thus,

where, in the last line, we have introduced the convenient notation S=S1D1S2Rn×kS=S_{1}D_{1}S_{2}\in R^{n\times k} that we will use throughout the remainder of this proof. In the sequel we seek to bound the residual

This implies that in eqn. (15) we can replace (AS)+A(AS)^{+}A with any other k×nk\times n matrix and the equality with an inequality. In particular we replace (AS)+A(AS)^{+}A with (AkS)+Ak(A_{k}S)^{+}A_{k}, where AkA_{k} is the best rank-kk approximation to AA:

Let Aρk=UρkΣρkVρkTA_{\rho-k}=U_{\rho-k}\Sigma_{\rho-k}V_{\rho-k}^{T}. Then, A=Ak+AρkA=A_{k}+A_{\rho-k} and, using the triangle inequality,

We now bound γ1\gamma_{1}, γ2\gamma_{2}, and γ3\gamma_{3}. First, for γ1\gamma_{1}, note that:

In eqn. (17), we replaced (UkΣkVkTS)+(U_{k}\Sigma_{k}V_{k}^{T}S)^{+} by (VkTS)+(UkΣk)+(V_{k}^{T}S)^{+}(U_{k}\Sigma_{k})^{+}. This follows since the statement of our lemma assumes that the matrix VkTS1D1V_{k}^{T}S_{1}D_{1} has full rank. Also, the construction of S2S_{2} guarantees that the columns of VkTS1D1V_{k}^{T}S_{1}D_{1} that are selected in the second stage of Algorithm 1 are linearly independent, and thus the k×kk\times k matrix VkTS=VkTS1D1S2V_{k}^{T}S=V_{k}^{T}S_{1}D_{1}S_{2} has full rank and is invertible. In eqn. (18), UkU_{k} and VkTV_{k}^{T} can be dropped without increasing a unitarily invariant norm, while eqn. (19) follows since VkTSV_{k}^{T}S is a full-rank k×kk\times k matrix. Next, note that γ2=\mboxAρkξ=\mboxAAkξ\gamma_{2}=\mbox{}\left\|A_{\rho-k}\right\|_{\xi}=\mbox{}\left\|A-A_{k}\right\|_{\xi}. Finally, to conclude the proof, we bound γ3\gamma_{3} as follows:

Eqn. (20) follows by the orthogonality of UρkU_{\rho-k} and VkV_{k} and the fact that VkTSV_{k}^{T}S is a k×kk\times k invertible matrix (see above). Eqn. (21) follows from the fact that for any two matrices XX and YY and ξ=2,F\xi=2,F, \mboxXYξ\mboxXξ\mboxY2\mbox{}\left\|XY\right\|_{\xi}\leq\mbox{}\left\|X\right\|_{\xi}\mbox{}\left\|Y\right\|_{2}. Finally, eqn. (22) follows since S=S1DS2S=S_{1}DS_{2} and S2S_{2} is an orthogonal matrix.

Let S1S_{1}, D1D_{1}, and S2S_{2} be constructed using Algorithm 1. Then, with probability at least 0.90.9,

Proof: From Lemma 1 we know that σi(VkTS1D1)1/2\sigma_{i}\left(V_{k}^{T}S_{1}D_{1}\right)\geq 1/2 holds for all i=1,,ki=1,\ldots,k with probability at least 0.90.9. The deterministic construction of S2S_{2} (see Algorithm 4 of with the parameter ff set to 2\sqrt{2}) guarantees that

(ξ=2\xi=2) If S1S_{1} and D1D_{1} are constructed as described in Algorithm 1, then, with probability at least 0.90.9,

Proof: Let Γ=ΣρkVρkTVρkΣρk=Σρk2\Gamma=\Sigma_{\rho-k}V_{\rho-k}^{T}V_{\rho-k}\Sigma_{\rho-k}=\Sigma_{\rho-k}^{2}. We manipulate \mboxΣρkVρkTS1D122\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{2}^{2} as follows:

Given our construction of S1S_{1} and D1D_{1} and applying eqn. (9) of Theorem 1 of with β=1/2\beta=1/2 and δ=0.1\delta=0.1, we get that with probability at least 0.90.9,

Thus, by combining the above results and using \mboxΣρk22=\mboxAAk22\mbox{}\left\|\Sigma_{\rho-k}^{2}\right\|_{2}=\mbox{}\left\|A-A_{k}\right\|_{2}^{2} and \mboxΣρkVρkTF2=\mboxAAkF2\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}\right\|_{F}^{2}=\mbox{}\left\|A-A_{k}\right\|_{F}^{2} we get

To conclude the proof of the lemma we take the square roots of both sides of the above inequality.

(ξ=F\xi=F) If S1S_{1} and D1D_{1} are constructed as described in Algorithm 1, then, with probability at least 0.90.9,

Proof: It is straightforward to prove that with our construction of S1S_{1} and D1D_{1}, the expectation of \mboxΣρkVρkTS1D1F2\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{F}^{2} is equal to \mboxΣρkVρkTF2\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}\right\|_{F}^{2}. In addition, note that the latter quantity is exactly equal to \mboxAAkF2\mbox{}\left\|A-A_{k}\right\|_{F}^{2}. Applying Markov’s inequality, we get that, with probability at least 0.90.9,

Taking square roots of both sides of the above inequality concludes the proof of the lemma.

4 Completing the proof of Theorem 1

To prove the Frobenius norm bound of Theorem 1 we combine Lemma 2 (with ξ=F\xi=F) with Lemmas 3 and 5. Thus, we get

Using c=Θ(klogk)c=\Theta\left(k\log k\right) immediately derives the Frobenius norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most 0.10.1 and that Lemma 5 fails with probability at most 0.10.1; thus, applying the standard union bound, it follows that the Frobenius norm bound of Theorem 1 holds with probability at least 0.80.8. To prove the spectral norm bound of Theorem 1 we combine Lemma 2 (with ξ=2\xi=2) with Lemmas 3 and 4. Thus, we get

Using c=Θ(klogk)c=\Theta\left(k\log k\right) immediately derives the spectral norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most 0.10.1 and that Lemma 4 fails with probability at most 0.10.1; thus, applying the standard union bound, it follows that the spectral norm bound of Theorem 1 holds with probability at least 0.80.8.

Acknowledgements

We are grateful to Daniel Spielman and Ilse Ipsen for numerous useful discussions on the results of this paper. We would also like to thank an anonymous reviewer of an earlier version of this manuscript who provided a counterexample to Lemma 4.4 of , and thus helped us identify the error in the proof of that lemma.

References

Appendix

for all i[n]i\in[n] for some constant β(0,1]\beta\in(0,1]. Let ϵ(0,1)\epsilon\in(0,1) be an accuracy parameter, assume c02\mboxAF24βϵ2c_{0}^{2}\mbox{}\left\|A\right\|_{F}^{2}\geq 4\beta\epsilon^{2}, and let

(Here c0c_{0} is the unknown constant of Theorem 3.1, p. 8 of .) Then,