An Improved Approximation Algorithm for the Column Subset Selection Problem

Christos Boutsidis, Michael W. Mahoney, Petros Drineas

Introduction

We consider the problem of selecting the “best” set of exactly $k$ columns from an $m\times n$ matrix $A$ . More precisely, we consider the following Column Subset Selection Problem (CSSP):

is minimized over all possible ${n\choose k}$ choices for the matrix $C$ . Here, $P_{C}=CC^{+}$ denotes the projection onto the $k$ -dimensional space spanned by the columns of $C$ and $\xi=2\ \mbox{or}\ F$ denotes the spectral norm or Frobenius norm.

That is, the goal of the CSSP is to find a subset of exactly $k$ columns of $A$ that “captures” as much of $A$ as possible, with respect to the spectral norm and/or Frobenius norm, in a projection sense. The CSSP has been studied extensively in numerical linear algebra, where it has found applications in, e.g., scientific computing . More recently, a relaxation has been studied in theoretical computer science, where it has been motivated by applications to large scientific and internet data sets .

We briefly comment on the complexity of the problem. Clearly, in $O(n^{k})$ time we can generate all possible matrices $C$ and thus find the optimal solution in $O(n^{k}mnk)$ time. However, from a practical perspective, in data analysis applications of the CSSP (see Section 1.2), $n$ is often in the order of hundreds or thousands. Thus, in practice, algorithms whose running time depends exponentially on $k$ are prohibitively slow even if $k$ is, from a theoretical perspective, a constant. Finally, the NP-hardness of the CSSP (assuming $k$ is a function of $n$ ) is an open problem. Note, though, that a similar problem, asking for the $k$ columns of the $m\times n$ matrix $A$ that maximize the volume of the parallelepiped spanned by the columns of $C$ , is provably NP-hard .

2 The CSSP in statistical data analysis

In data applications, where the input matrix $A$ models $m$ objects represented with respect to $n$ features, the CSSP corresponds to unsupervised feature selection. Standard motivations for feature selection include facilitating data visualization, reducing training times, avoiding overfitting, and facilitating data understanding.

Consider, in particular, Principal Components Analysis (PCA), which is the predominant linear dimensionality reduction technique, and which has been widely applied on datasets in all scientific domains, from the social sciences and economics, to biology and chemistry. In words, PCA seeks to map or embed data points from a high dimensional Euclidean space to a low dimensional Euclidean space while keeping all the relevant linear structure intact. PCA is an unsupervised dimensionality reduction technique, with the sole input parameters being the coordinates of the data points and the number of dimensions that will be retained in the embedding (say $k$ ), which is typically a constant independent of $m$ and $n$ ; often it is $k\ll\{m,n\}$ too. Data analysts often seek a subset of $k$ actual features (that is, $k$ actual columns, as opposed to the $k$ eigenvectors or eigenfeatures returned by PCA) that can accurately reproduce the structure derived by PCA. The CSSP is the obvious optimization problem associated with such unsupervised feature selection tasks.

We should note that similar formulations appeared in . In addition, applications of such ideas include: ( $i$ ) , where a “compact CUR matrix decomposition” was applied to static and dynamic data analysis in large sparse graphs; ( $ii$ ) , where these ideas were used for compression and classification of hyperspectral medical data and the reconstruction of missing entries from recommendation systems data in order to make high-quality recommendations; and ( $iii$ ) , where the concept of “PCA-correlated SNPs” (Single Nucleotide Polymorphisms) was introduced and applied to classify individuals from throughout the world without the need for any prior ancestry information. See for a detailed evaluation of our main algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis (finance, document-term data, and genetics).

3 Our main results

We present a novel two-stage algorithm for the CSSP. This algorithm is presented in detail in Section 3 as Algorithm 1. In the first stage of this algorithm (the randomized stage), we randomly select $\Theta(k\log k)$ columns of $V_{k}^{T}$ , i.e., of the transpose of the $n\times k$ matrix consisting of the top $k$ right singular vectors of $A$ , according to a judiciously-chosen probability distribution that depends on information in the top- $k$ right singular subspace of $A$ . Then, in the second stage (the deterministic stage), we apply a deterministic column-selection procedure to select exactly $k$ columns from the set of columns of $V_{k}^{T}$ selected by the first stage. The algorithm then returns the corresponding $k$ columns of $A$ . In Section 4 we prove the following theorem.

There exists an algorithm (the two-stage Algorithm 1) that approximates the solution to the CSSP. This algorithm takes as input an $m\times n$ matrix $A$ and a positive integer $k$ ; it runs in $O(\min\{mn^{2},m^{2}n\})$ time; and it returns as output an $m\times k$ matrix $C$ consisting of exactly $k$ columns of $A$ such that with probability at least $0.8$ :

Here, $P_{C}=CC^{+}$ denotes a projection onto the column span of the matrix $C$ , and $A_{k}$ denotes the best rank- $k$ approximation to the matrix $A$ as computed with the singular value decomposition.

Note that we can trivially boost the success probability in the above theorem to $1-\delta$ by repeating the algorithm $O\left(\log\left(1/\delta\right)\right)$ times. Note also that the running time of our algorithm is linear in the larger of the dimensions $m$ and $n$ , quadratic in the smaller one, and independent of $k$ . Thus, it is practically useful and efficient.

To put our results into perspective, we compare them to the best existing results for the CSSP. Prior work provided bounds of the form

where $p(k,n)$ is a polynomial on $n$ and $k$ . For $\xi=2$ , i.e., for the spectral norm, the best previously-known bound for approximating the CSSP is $p(k,n)=\Theta\left(\sqrt{k(n-k)+1}\right)$ , while for $\xi=F$ , i.e., for the Frobenius norm, the best bound is $p(k,n)=\sqrt{(k+1)!}$ . Both results are algorithmically efficient, running in time polynomial in all three parameters $m$ , $n$ , and $k$ . The former runs in $O(mnk\log n)$ time and the latter runs in $O(mnk+kn)$ time. Our approach provides an algorithmic bound for the Frobenius norm version of the CSSP that is roughly $O(\sqrt{k!})$ better than the best previously-known algorithmic result. It should be noted that also proves that by exhaustively testing all ${n\choose k}$ possibilities for the matrix $C$ , the best one will satisfy eqn. (1) with $p(k,n)=\sqrt{k+1}$ . Our algorithmic result is only $O(\sqrt{k\log k})$ worse than this existential result. A similar existential result for the spectral norm version of the CSSP is proved in with $p(k,n)=\sqrt{k(n-k)+1}$ . Our spectral norm bound depends on $\Theta\left(k^{3/4}\log^{1/4}k\right)\mbox{}\left\|A-A_{k}\right\|_{F}$ . In a worst case setting (e.g., when all the bottom $n-k+1$ singular values of $A$ are equal) this quantity is upper bounded by $\Theta\left(\left(n-k\right)^{1/2}k^{3/4}\log^{1/4}k\right)\mbox{}\left\|A-A_{k}\right\|_{2}$ . This is worse than the best results for the spectral norm version of the CSSP by a factor of $\Theta\left(k^{1/4}\log^{1/4}k\right)$ .

Finally, we should emphasize that a novel feature of the algorithm that we present in this paper is that it combines in a nontrivial manner recent algorithmic developments in the theoretical computer science community with more traditional techniques from the numerical linear algebra community in order to obtain improved bounds for the CSSP.

Background and prior work

First, recall that the $\Theta$ -notation can be used to denote an asymptotically tight bound: $f(n)\in\Theta(g(n))$ or $f(n)=\Theta(g(n))$ if there exist positive constants $c_{1}$ , $c_{2}$ , and $n_{0}$ such that $0\leq c_{1}g(n)\leq f(n)\leq c_{2}g(n)$ for all $n\geq n_{0}$ . This is similar to the way in which the big- $O$ -notation can be used to denote an asymptotic upper bound: $f(n)=O(g(n))$ if there exist positive constants $c$ and $n_{0}$ such that $0\leq f(n)\leq cg(n)$ for all $n\geq n_{0}$ .

Finally, we will make frequent use of the following fundamental result from probability theory, known as Markov’s inequality . Let $X$ be a random variable assuming non-negative values with expectation $\mbox{}{\bf{E}}\left[X\right]$ . Then, for all $t>0$ , $X\leq t\cdot\mbox{}{\bf{E}}\left[X\right]$ with probability at least $1-t^{-1}$ . We will also need the so-called union bound. Given a set of probabilistic events ${\cal E}_{1},{\cal E}_{2},\ldots,{\cal E}_{n}$ holding with respective probabilities $p_{1},p_{2},\ldots,p_{n}$ , the probability that all events hold (a.k.a., the probability of the union of those events) is upper bounded by $\sum_{i=1}^{n}p_{i}$ .

2 Related prior work

Since solving the CSSP exactly is a hard combinatorial optimization problem, research has historically focused on computing approximate solutions to it. Since $\mbox{}\left\|A-A_{k}\right\|_{\xi}$ provides an immediate lower bound for $\mbox{}\left\|A-P_{C}A\right\|_{\xi}$ , for $\xi=2,F$ and for any choice of $C$ , a large number of approximation algorithms have been proposed to select a subset of $k$ columns of $A$ such that the resulting matrix $C$ satisfies

for some function $p(k,n)$ . Within the numerical linear algebra community, most of the work on the CSSP has focused on spectral norm bounds and is related to the so-called Rank Revealing QR (RRQR) factorization:

(The RRQR factorization) Given a matrix $A\in R^{m\times n}$ ( $m\geq n$ ) and an integer $k$ ( $k\leq n$ ), assume partial $QR$ factorizations of the form:

where $Q\in R^{m\times n}$ is an orthonormal matrix, $R\in R^{n\times n}$ is upper triangular, $R_{11}\in R^{k\times k}$ , $R_{12}\in R^{k\times(n-k)}$ , $R_{22}\in R^{(n-k)\times(n-k)}$ , and $\Pi\in R^{n\times n}$ is a permutation matrix. The above factorization is called a RRQR factorization if it satisfies

where $p_{1}(k,n)$ and $p_{2}(k,n)$ are functions bounded by low degree polynomials in $k$ and $n$ .

The work of Golub on pivoted $QR$ factorizations was followed by much research addressing the problem of constructing an efficient RRQR factorization. Most researchers improved RRQR factorizations by focusing on improving the functions $p_{1}(k,n)$ and $p_{2}(k,n)$ in Definition 2. Let $\Pi_{k}$ denote the first $k$ columns of a permutation matrix $\Pi$ . Then, if $C=A\Pi_{k}$ is an $m\times k$ matrix consisting of $k$ columns of $A$ , it is straightforward to prove that

for both $\xi=2,F$ . Thus, in particular, when applied to the spectral norm, it follows that

i.e., any algorithm that constructs an RRQR factorization of the matrix $A$ with provable guarantees also provides provable guarantees for the CSSP. See Table 1 for a summary of existing results, and see for a survey and an empirical evaluation of some of these algorithms. More recently, proposed random-projection type algorithms that achieve the same spectral norm bounds as prior work while improving the running time.

Within the theoretical computer science community, much work has followed that of Frieze, Kannan, and Vempala on selecting a small subset of representative columns of $A$ , forming a matrix $C$ , such that the projection of $A$ on the subspace spanned by the columns of $C$ is as close to $A$ as possible. The algorithms from this community are randomized, which means that they come with a failure probability, and focus mainly on the Frobenius norm. It is worth noting that they provide a strong tradeoff between the number of selected columns and the desired approximation accuracy. A typical scenario for these algorithms is that the desired approximation error (see $\epsilon$ below) is given as input, and then the algorithm selects the minimum number of appropriate columns in order to achieve this error. One of the most relevant results for this paper is a bound of , which states that there exist exactly $k$ columns in any $m\times n$ matrix $A$ such that

Here, $C$ contains exactly $k$ columns of $A$ . The only known algorithm to find these $k$ columns is to try all ${n\choose k}$ choices and keep the best. This existential result relies on the so-called volume sampling method . In , an adaptive sampling method is used to approximate the volume sampling method and leads to an $O(mnk+kn)$ algorithm which finds $k$ columns of $A$ such that

As mentioned above, much work has also considered algorithms choosing slightly more than $k$ columns. This relaxation provides significant flexibility and improved error bounds. For example, in , an adaptive sampling method leads to an $O\left(mn\left(k/\epsilon^{2}+k^{2}\log k\right)\right)$ algorithm, such that

holds with high probability for some matrix $C$ consisting of $\Theta\left(k/\epsilon^{2}+k^{2}\log k\right)$ columns of $A$ . Similarly, in , Drineas, Mahoney, and Muthukrishnan leverage the subspace sampling method to give an $O(\min\{mn^{2},m^{2}n\})$ algorithm such that

holds with high probability if $C$ contains $\Theta(k\log k/\epsilon^{2})$ columns of $A$ .

A two-stage algorithm for the CSSP

In this section, we present and describe Algorithm 1, our main algorithm for approximating the solution to the CSSP. This algorithm takes as input an $m\times n$ matrix $A$ and a rank parameter $k$ . After an initial setup, the algorithm has two stages: a randomized stage and a deterministic stage. In the randomized stage, a randomized procedure is run to select $\Theta\left(k\log k\right)$ columns from the $k\times n$ matrix $V_{k}^{T}$ , i.e., the transpose of the matrix containing the top- $k$ right singular vectors of $A$ . The columns are chosen by randomly sampling according to a judiciously-chosen nonuniform probability distribution that depends on information in the top- $k$ right singular subspace of $A$ . Then, in the deterministic stage, a deterministic procedure is employed to select exactly $k$ columns from the $\Theta\left(k\log k\right)$ columns chosen in the randomized stage. The algorithm then outputs exactly $k$ columns of $A$ that correspond to those columns chosen from $V_{k}^{T}$ . Theorem 1 states that the projection of $A$ on the subspace spanned by these $k$ columns of $A$ is (up to bounded error) close to the best rank $k$ approximation to $A$ .

In more detail, Algorithm 1 first computes a probability distribution $p_{1},p_{2},\ldots,p_{n}$ over the set $\{1,\ldots,n\}$ , i.e., over the columns of $V_{k}^{T}$ , or equivalently over the columns of $A$ . The probability distribution depends on information in the top- $k$ right singular subspace of $A$ . In particular, for all $i\in[n]$ , define

and note that $p_{i}\geq 0$ , for all $i\in[n]$ , and that $\sum_{i=1}^{n}p_{i}=1$ . We will describe the computation of probabilities of this form below.

In the randomized stage, Algorithm 1 employs the following randomized column selection algorithm to choose $\Theta(k\log k)$ columns from $V_{k}^{T}$ to pass to the second stage. Let $c$ assume the value of eqn. (11). In $c$ independent identically distributed (i.i.d.) trials, the algorithm chooses a column of $V_{k}^{T}$ where in each trial the $i$ -th column of $V_{k}^{T}$ is kept with probability $p_{i}$ . Additionally, if the $i$ -th column is kept, then a scaling factor equal to $1/\sqrt{cp_{i}}$ is kept as well. Thus, at the end of this process, we will be left with $c$ columns of $V_{k}^{T}$ and their corresponding scaling factors. Notice that due to random sampling in i.i.d. trials with replacement we might keep a particular column more than once.

In order to conveniently represent the $c$ selected columns and the associated scaling factors, we will use the following sampling matrix formalism. First, define an $n\times c$ sampling matrix $S_{1}$ as follows: $S_{1}$ is initially empty; at each of the $c$ i.i.d. trials, if the $i$ -th column of $V_{k}^{T}$ is selected by the random sampling process, then $e_{i}$ (an $n$ -vector of all-zeros, except for its $i$ -th entry which is set to one) is appended to $S_{1}$ . Next, define the $c\times c$ diagonal rescaling matrix $D_{1}$ as follows: if the $i$ -th column of $V_{k}^{T}$ is selected, then a diagonal entry of $D_{1}$ is set to $1/\sqrt{cp_{i}}$ . Thus, we may view the randomized stage as outputting the matrix $V_{k}^{T}S_{1}D_{1}$ consisting of a small number of rescaled columns of $V_{k}^{T}$ , or simply as outputting $S_{1}$ and $D_{1}$ .

In the deterministic stage, Algorithm 1 applies a deterministic column selection algorithm to the output of the first stage in order to choose exactly $k$ columns from the input matrix $A$ . To do so, we run the Algorithm 4 of (with the parameter $f$ set to $\sqrt{2}$ ) on the $k\times c$ matrix $V_{k}^{T}S_{1}D_{1}$ , i.e., the column-scaled version of the columns of $V_{k}^{T}$ chosen in the first stage. Thus, a matrix $V_{k}^{T}S_{1}D_{1}S_{2}$ is formed, or equivalently, in the sampling matrix formalism described previously, a new matrix $S_{2}$ is constructed. Its dimensions are $c\times k$ , since it selects exactly $k$ columns out of the $c$ columns returned after the end of the randomized stage. The algorithm then returns the corresponding $k$ columns of the original matrix $A$ , i.e., after the second stage of the algorithm is complete, the $m\times k$ matrix $C=AS_{1}S_{2}$ is returned as the final output.

2 Running time analysis

We now discuss the running time of our algorithm. Note that manipulating the probability distribution of eqn. (10) yields:

Thus, knowledge of $V_{k}$ , i.e., the $n\times k$ matrix consisting of the top- $k$ right singular vectors of $A$ , suffices to compute the $p_{i}$ ’s. By eqn. (12), $O(\min\{mn^{2},m^{2}n\})$ time suffices for our theoretical analysis. In practice iterative algorithms could be used to speed up the algorithm. Note also that in order to obtain the Frobenius norm bound of Theorem 1, our theoretical analysis holds if the sampling probabilities are of the form:

That is, the Frobenius norm bound of Theorem 1 holds even if the second term in the sampling probabilities of eqns. (10) and (12) is omitted.

which is the only guarantee that we need in the deterministic step (see Lemma 3). The running time of the deterministic stage of Algorithm 1 is $O(c^{2}k\log\sqrt{c})$ time, since the (padded) matrix $V_{k}^{T}S_{1}D_{1}$ has $c$ columns and rows.

An important open problem would be to identify other suitable importance sampling probability distributions that avoid the computation of a basis for the top- $k$ right singular subspace.

3 Intuition underlying our main algorithm

Intuitively, we achieve improved bounds for the CSSP because we apply the deterministic algorithm to a lower dimensional matrix (the matrix $V_{k}^{T}S_{1}D_{1}$ with $\Theta\left(k\log k\right)$ columns, as opposed to the matrix $A$ with $n$ columns) in which the columns are “spread out” in a “nice” manner. To see this, note that the probability distribution of eqn. (13), and thus one of the two terms in the probability distribution of eqns. (10) or (12), equals (up to scaling) the diagonal elements of the projection matrix onto the span of the top- $k$ right singular subspace. In diagnostic regression analysis, these quantities have a natural interpretation in terms of statistical leverage, and thus they have been used extensively to identify “outlying” data points . Thus, the importance sampling probabilities that we employ in the randomized stage of our main algorithm provide a bias toward more “outlying” columns, which then provide a “nice” starting point for the deterministic stage of our main algorithm. This also provides intuition as to why using importance sampling probabilities of the form of eqn. (13) leads to relative-error low-rank matrix approximations .

Proof of Theorem 1

In this section, we provide a proof of Theorem 1. We start with an outline of our proof, pointing out conceptual improvements that were necessary in order to obtain improved bounds. An important condition in the first phase of the algorithm is that when we sample columns from the $k\times n$ matrix $V_{k}^{T}$ , we obtain a $k\times c$ matrix $V_{k}^{T}S_{1}D_{1}$ that does not lose any rank. To do so, we will apply a result from matrix perturbation theory to prove that if $c=\Theta(k\log k)$ (see eqn. (11)) then $\left|\sigma_{k}^{2}\left(V_{k}^{T}S_{1}D_{1}\right)-1\right|\leq 1/2$ . (See Lemma 1 below.) Then, under the assumption that $V_{k}^{T}S_{1}D_{1}$ has full rank, we will prove that the $m\times k$ matrix $C$ returned by the algorithm will satisfy:

for both $\xi=2,F$ . (See Lemma 2 below.) Next, we will provide a bound on $\sigma_{k}^{-1}\left(V_{k}^{T}S_{1}D_{1}S_{2}\right)$ . In order to get a strong accuracy guarantee for the overall algorithm, the deterministic column selection algorithm must satisfy

where $p(k,c)$ is a polynomial in both $k$ and $c$ . Thus, for our main theorem, we will employ Algorithm 4 with $f=\sqrt{2}$ , which guarantees the above bound with $p(k,c)=\sqrt{2k\left(c-k\right)+1}$ . (See Lemma 3 below.) Finally, we will show, using relatively straightforward matrix perturbation techniques, that $\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{\xi}$ is not too much more, in a multiplicative sense, than $\mbox{}\left\|A-A_{k}\right\|_{\xi}$ , where we note that the factors differ for $\xi=2,F$ . (See Lemmas 4 and 5 below.) By combining these results, the main theorem will follow.

The following lemma provides a bound on the singular values of the matrix $V_{k}^{T}S_{1}D_{1}$ computed by the randomized phase of Algorithm 1, from which it will follow that the matrix $V_{k}^{T}S_{1}D_{1}$ is full rank. To prove the lemma, we employ Theorem 2 of the Appendix (this theorem is a variant of a result of Rudelson and Vershynin in ). Note that probabilities of the form of eqn. (13) actually suffice to establish Lemma 1.

Let $S_{1}$ and $D_{1}$ be constructed using Algorithm 1. Then, with probability at least $0.9$ ,

In particular, $V_{k}^{T}S_{1}D_{1}$ has full rank.

Proof: In order to bound $\sigma_{k}\left(V_{k}^{T}S_{1}D_{1}\right)$ , we will bound $\mbox{}\left\|V_{k}^{T}S_{1}D_{1}D_{1}S_{1}^{T}V_{k}-I_{k}\right\|_{2}$ . Towards that end, we will use Theorem 2 with $\beta=1/2$ and $\epsilon=1/20$ , which results in the value for $c$ in eqn. (11). Note that the sampling probabilities in eqn. (12) satisfy

Now Theorem 2 and our construction of $S_{1}$ and $D_{1}$ guarantee that for $c$ as in eqn. (11)

We note here that the condition $c_{0}^{2}\mbox{}\left\|V_{k}\right\|_{F}^{2}\geq 4\beta\epsilon^{2}$ in Theorem 2 is trivially satisfied assuming that $c_{0}$ is at least one (given our choices for $\beta$ , $\epsilon$ , and $\mbox{}\left\|V_{k}\right\|_{F}^{2}=k\geq 1$ ). Using $V_{k}^{T}V_{k}=I_{k}$ and Markov’s inequality we get that with probability at least $0.9$ ,

Standard matrix perturbation theory results now imply that for all $i=1,\ldots,k$ ,

Let $S_{1}$ , $D_{1}$ , and $S_{2}$ be constructed as described in Algorithm 1 and recall that $C=AS_{1}S_{2}$ . If $V_{k}^{T}S_{1}D_{1}$ has full rank, then for $\xi=2,F$ ,

Proof: We seek to bound the spectral and Frobenius norms of $A-P_{C}A$ , where $C=AS_{1}S_{2}$ is constructed by Algorithm 1. To do so, first notice that scaling the columns of a matrix (equivalently, post-multiplying the matrix by a diagonal matrix) by any non-zero scale factors does not change the subspace spanned by the columns of the matrix. Thus,

where, in the last line, we have introduced the convenient notation $S=S_{1}D_{1}S_{2}\in R^{n\times k}$ that we will use throughout the remainder of this proof. In the sequel we seek to bound the residual

This implies that in eqn. (15) we can replace $(AS)^{+}A$ with any other $k\times n$ matrix and the equality with an inequality. In particular we replace $(AS)^{+}A$ with $(A_{k}S)^{+}A_{k}$ , where $A_{k}$ is the best rank- $k$ approximation to $A$ :

Let $A_{\rho-k}=U_{\rho-k}\Sigma_{\rho-k}V_{\rho-k}^{T}$ . Then, $A=A_{k}+A_{\rho-k}$ and, using the triangle inequality,

We now bound $\gamma_{1}$ , $\gamma_{2}$ , and $\gamma_{3}$ . First, for $\gamma_{1}$ , note that:

In eqn. (17), we replaced $(U_{k}\Sigma_{k}V_{k}^{T}S)^{+}$ by $(V_{k}^{T}S)^{+}(U_{k}\Sigma_{k})^{+}$ . This follows since the statement of our lemma assumes that the matrix $V_{k}^{T}S_{1}D_{1}$ has full rank. Also, the construction of $S_{2}$ guarantees that the columns of $V_{k}^{T}S_{1}D_{1}$ that are selected in the second stage of Algorithm 1 are linearly independent, and thus the $k\times k$ matrix $V_{k}^{T}S=V_{k}^{T}S_{1}D_{1}S_{2}$ has full rank and is invertible. In eqn. (18), $U_{k}$ and $V_{k}^{T}$ can be dropped without increasing a unitarily invariant norm, while eqn. (19) follows since $V_{k}^{T}S$ is a full-rank $k\times k$ matrix. Next, note that $\gamma_{2}=\mbox{}\left\|A_{\rho-k}\right\|_{\xi}=\mbox{}\left\|A-A_{k}\right\|_{\xi}$ . Finally, to conclude the proof, we bound $\gamma_{3}$ as follows:

Eqn. (20) follows by the orthogonality of $U_{\rho-k}$ and $V_{k}$ and the fact that $V_{k}^{T}S$ is a $k\times k$ invertible matrix (see above). Eqn. (21) follows from the fact that for any two matrices $X$ and $Y$ and $\xi=2,F$ , $\mbox{}\left\|XY\right\|_{\xi}\leq\mbox{}\left\|X\right\|_{\xi}\mbox{}\left\|Y\right\|_{2}$ . Finally, eqn. (22) follows since $S=S_{1}DS_{2}$ and $S_{2}$ is an orthogonal matrix.

Let $S_{1}$ , $D_{1}$ , and $S_{2}$ be constructed using Algorithm 1. Then, with probability at least $0.9$ ,

Proof: From Lemma 1 we know that $\sigma_{i}\left(V_{k}^{T}S_{1}D_{1}\right)\geq 1/2$ holds for all $i=1,\ldots,k$ with probability at least $0.9$ . The deterministic construction of $S_{2}$ (see Algorithm 4 of with the parameter $f$ set to $\sqrt{2}$ ) guarantees that

( $\xi=2$ ) If $S_{1}$ and $D_{1}$ are constructed as described in Algorithm 1, then, with probability at least $0.9$ ,

Proof: Let $\Gamma=\Sigma_{\rho-k}V_{\rho-k}^{T}V_{\rho-k}\Sigma_{\rho-k}=\Sigma_{\rho-k}^{2}$ . We manipulate $\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{2}^{2}$ as follows:

Given our construction of $S_{1}$ and $D_{1}$ and applying eqn. (9) of Theorem 1 of with $\beta=1/2$ and $\delta=0.1$ , we get that with probability at least $0.9$ ,

Thus, by combining the above results and using $\mbox{}\left\|\Sigma_{\rho-k}^{2}\right\|_{2}=\mbox{}\left\|A-A_{k}\right\|_{2}^{2}$ and $\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}\right\|_{F}^{2}=\mbox{}\left\|A-A_{k}\right\|_{F}^{2}$ we get

To conclude the proof of the lemma we take the square roots of both sides of the above inequality.

( $\xi=F$ ) If $S_{1}$ and $D_{1}$ are constructed as described in Algorithm 1, then, with probability at least $0.9$ ,

Proof: It is straightforward to prove that with our construction of $S_{1}$ and $D_{1}$ , the expectation of $\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}S_{1}D_{1}\right\|_{F}^{2}$ is equal to $\mbox{}\left\|\Sigma_{\rho-k}V_{\rho-k}^{T}\right\|_{F}^{2}$ . In addition, note that the latter quantity is exactly equal to $\mbox{}\left\|A-A_{k}\right\|_{F}^{2}$ . Applying Markov’s inequality, we get that, with probability at least $0.9$ ,

Taking square roots of both sides of the above inequality concludes the proof of the lemma.

4 Completing the proof of Theorem 1

To prove the Frobenius norm bound of Theorem 1 we combine Lemma 2 (with $\xi=F$ ) with Lemmas 3 and 5. Thus, we get

Using $c=\Theta\left(k\log k\right)$ immediately derives the Frobenius norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most $0.1$ and that Lemma 5 fails with probability at most $0.1$ ; thus, applying the standard union bound, it follows that the Frobenius norm bound of Theorem 1 holds with probability at least $0.8$ . To prove the spectral norm bound of Theorem 1 we combine Lemma 2 (with $\xi=2$ ) with Lemmas 3 and 4. Thus, we get

Using $c=\Theta\left(k\log k\right)$ immediately derives the spectral norm bound of Theorem 1. Notice that Lemma 3 fails with probability at most $0.1$ and that Lemma 4 fails with probability at most $0.1$ ; thus, applying the standard union bound, it follows that the spectral norm bound of Theorem 1 holds with probability at least $0.8$ .

Acknowledgements

We are grateful to Daniel Spielman and Ilse Ipsen for numerous useful discussions on the results of this paper. We would also like to thank an anonymous reviewer of an earlier version of this manuscript who provided a counterexample to Lemma 4.4 of , and thus helped us identify the error in the proof of that lemma.

References

Appendix

for all $i\in[n]$ for some constant $\beta\in(0,1]$ . Let $\epsilon\in(0,1)$ be an accuracy parameter, assume $c_{0}^{2}\mbox{}\left\|A\right\|_{F}^{2}\geq 4\beta\epsilon^{2}$ , and let

(Here $c_{0}$ is the unknown constant of Theorem 3.1, p. 8 of .) Then,