Relative Errors for Deterministic Low-Rank Matrix Approximations

Mina Ghashami, Jeff M. Phillips

Introduction

The data streaming paradigm considers computation on a large data set AA where one data item arrives at a time, is processed, and then is not read again. It enforces that only a small amount of memory is available at any given time. This small space constraint is critical when the full data set cannot fit in memory or disk. Typically, the amount of space required is traded off with the accuracy of the computation on AA. Usually the computation results in some summary S(A)S(A) of AA, and this trade-off determines how accurate one can be with the available space resources. Although computational runtime is important, in this paper we mainly focus on space constraints and the types of approximation guarantees that can be made.

In truly large datasets, one processor (and memory) is often incapable of handling all of the dataset AA in a feasible amount of time. Even reading a terabyte of data on a single processor can take many hours. Thus this computation is often spread among some set of processors, and then the summary of AA is combined after (or sometimes during ) its processing on each processor. Again often each item is read once, whether it comes from a single large source or is being generated on the fly. The key computational problem shifts from updating a summary S(A)S(A) when witnessing a single new data item (the streaming model), to taking two summaries S(A1)S(A_{1}) and S(A2)S(A_{2}) and constructing a single new summary S(A1A2)S(A_{1}\cup A_{2}). In this new paradigm the goal is to have the same space-approximation trade-offs in S(A1A2)S(A_{1}\cup A_{2}) as possible for a streaming algorithm. When such a process is possible, the summary is known as a mergeable summary . Linear sketches are trivially mergeable, so this allows many streaming algorithms to directly translate to this newer paradigm. Again, space is a critical resource since it directly corresponds with the amount of data needed to transmit across the network, and emerging cost bottleneck in big data systems.

In this paper we focus on deterministic mergeable summaries for low-rank matrix approximation, based on recent work by Liberty , that is already known to be mergeable . Thus our focus is a more careful analysis of the space-error trade off for the algorithm, and we describe them under the streaming setting for simplicity; all bounds directly carry over into mergeable summary results.

In particular we re-analyze the Frequent Directions algorithm of Liberty to show it provides relative error bounds for matrix sketching, and conjecture it achieves the optimal space, up to log\log factors, for any row-update based summary. This supports the strong empirical results of Liberty . His analysis only provided additive error bounds which are hard to compare to more conventional ways of measuring accuracy of matrix approximation algorithms.

In this problem AA is an n×dn\times d matrix and the stream processes each row aia_{i} (of length dd) at a time. Typically the matrix is assumed to be tall so ndn\gg d, and sometimes the matrix will be assumed to be sparse so the number of non-zero entries nnz(A){\small\textsf{nnz}}(A) of AA will be small, nnz(A)nd{\small\textsf{nnz}}(A)\ll nd (e.g. nnz(A)=O((n+d)log(nd))){\small\textsf{nnz}}(A)=O((n+d)\log(nd))).

The best rank-kk approximation to AA (under Frobenius or 2 norm) is denoted as AkA_{k} and can be computed in O(nd2)O(nd^{2}) time on a tall matrix using the singular value decomposition. The svd(A){\small\textsf{svd}}(A) produces three matrices UU, SS, and VV where UU and VV are orthonormal, of size n×nn\times n and d×dd\times d, respectively, and SS is n×dn\times d but only has non-zero elements on its diagonal {σ1,,σd}\{\sigma_{1},\ldots,\sigma_{d}\}. Let UkU_{k}, SkS_{k}, and VkV_{k} be the first kk columns of each matrix, then A=USVTA=USV^{T} and Ak=UkSkVkTA_{k}=U_{k}S_{k}V_{k}^{T}. Note that although AkA_{k} requires O(nd)O(nd) space, the set of matrices {Uk,Sk,Vk}\{U_{k},S_{k},V_{k}\} require only a total of O((n+d)k)O((n+d)k) space (or O(nk)O(nk) if the matrix is tall). Moreover, even the set {U,S,V}\{U,S,V\} really only takes O(nd+d2)O(nd+d^{2}) space since we can drop the last ndn-d columns of UU, and the last ndn-d rows of SS without changing the result. In the streaming version, the goal is to compute something that replicates the effect of AkA_{k} using less space and only seeing each row once.

The strongest version, (providing construction bounds) for some parameter ε(0,1)\varepsilon\in(0,1), is some representation of a rank kk matrix A^\hat{A} such that AA^ξ(1+ε)AAkξ\|A-\hat{A}\|_{\xi}\leq(1+\varepsilon)\|A-A_{k}\|_{\xi} for ξ={2,F}\xi=\{2,F\}. Unless AA is sparse, then storing A^\hat{A} explicitly may require Ω(nd)\Omega(nd) space, so that is why various representations of A^\hat{A} are used in its place. This can include decompositions similar to the SVD, e.g. a CUR decomposition where A^=CUR\hat{A}=CUR and where UU is small and dense, and CC and RR are sparse and skinny, or others where the middle matrix is still diagonal. The sparsity is often preserved by constructing the wrapper matrices (e.g. CC and RR) from the original columns or rows of AA. There is an obvious Ω(n+d)\Omega(n+d) space bound for any construction result in order to preserve the column and the row space.

Projection bounds.

Streaming algorithms.

Many of these algorithms are not streaming algorithms. To the best of our understanding, the best streaming algorithm is due to Clarkson and Woodruff. All bounds assume each matrix entry requires O(lognd)O(\log nd) bits. It is randomized and it constructs a decomposition of a rank kk matrix A^\hat{A} that satisfies AA^F(1+ε)AAkF\|A-\hat{A}\|_{F}\leq(1+\varepsilon)\|A-A_{k}\|_{F}, with probability at least 1δ1-\delta. This provides a relative error construction bound of size O((k/ε)(n+d)log(nd))O((k/\varepsilon)(n+d)\log(nd)) bits. They also show an Ω((k/ε)(n+d))\Omega((k/\varepsilon)(n+d)) bits lower bound.

Although not explicitly described in their paper, one can directly use their techniques and analysis to achieve a weak form of a projection bound. One maintains a matrix B=ASB=AS with m=O((k/ε)log(1/δ))m=O((k/\varepsilon)\log(1/\delta)) columns where SS is a d×md\times m matrix where each entry is chosen from {1,+1}\{-1,+1\} at random. Then setting A^=πB(A)\hat{A}=\pi_{B}(A), achieves a (1+ε)(1+\varepsilon) projection bound, however BB is rank O((k/ε)log(1/δ))O((k/\varepsilon)\log(1/\delta)) and hence that is the only bound on A^\hat{A} as well. The construction lower bound suggests that there is an Ω(dk/ε)\Omega(dk/\varepsilon) bits lower bound for projection, but this is not directly proven.

They also study this problem in the turnstile model where each element of the matrix can be updated at each step (including subtractions). In this model they require O((k/ε2)(n+d/ε2)log(nd))O((k/\varepsilon^{2})(n+d/\varepsilon^{2})\log(nd)) bits, and show an Ω((k/ε)(n+d)log(nd))\Omega((k/\varepsilon)(n+d)\log(nd)) bits lower bound.

Another more general “coreset” result is provided by Feldman et al. . In the streaming setting it requires O((k/ε)logn)O((k/\varepsilon)\log n) space and can be shown to provide a rank O(k/ε)O(k/\varepsilon) matrix BB that satisfies a relative error bound of the form AπB(A)F2(1+ε)AAkF2\|A-\pi_{B}(A)\|_{F}^{2}\leq(1+\varepsilon)\|A-A_{k}\|_{F}^{2}.

Column sampling.

Although not typically described as streaming algorithms (perhaps because the focus was on sampling columns which already have length nn) when a matrix is processed row wise there exists algorithms that can use reservoir sampling to become streaming. The best streaming algorithm samples O(k/ε2)O(k/\varepsilon^{2}) rows (proportional to their squared norm) to obtain a matrix RR so that AπR(A)F2AAkF2+εAF2\|A-\pi_{R}(A)\|_{F}^{2}\leq\|A-A_{k}\|_{F}^{2}+\varepsilon\|A\|_{F}^{2}, a weaker additive error bound. These techniques can also build approximate decompositions of A^\hat{A} instead of using πR(A)\pi_{R}(A), but again these decompositions are only known to work with at least 22 passes, and are thus not streaming.

Other.

There is a wealth of literature on this problem; most recently two algorithms showed how to construct a decomposition of A^\hat{A} that has rank kk with error bound AA^F2(1+ε)AAkF2\|A-\hat{A}\|_{F}^{2}\leq(1+\varepsilon)\|A-A_{k}\|_{F}^{2} with constant probability in approximately O(nnz(A))O({\small\textsf{nnz}}(A)) time. We refer to these papers for a more thorough survey of the history of the area, many other results, and other similar approximate linear algebra applications. But we attempt to report many of the most important related results in Appendix A.

Finally we mention a recent algorithm by Liberty which runs in O(nd/ε)O(nd/\varepsilon) time, maintains a matrix with 2/ε2/\varepsilon rows in a row-wise streaming algorithm, and produces a matrix A^\hat{A} of rank at most 2/ε2/\varepsilon so that for any unit vector xx of length dd satisfies 0Ax2A^x2εAF20\leq\|Ax\|^{2}-\|\hat{A}x\|^{2}\leq\varepsilon\|A\|^{2}_{F}. We examine a slight variation of this algorithm and describe bounds that it achieves in more familiar terms.

Incremental PCA.

We mention one additional line of work on incremental PCA . These approaches attempt to maintain the PCA of a dataset AA (using SVD and a constant amount of additional bookkeeping space) as each row of AA arrives in a stream. In particular, after i1i-1 rows they consider maintaining AkiA^{i}_{k}, and on a new row aia_{i} compute svd([Aki;ai])=UiSi(Vi)T{\small\textsf{svd}}([A^{i}_{k};a_{i}])=U^{i}S^{i}(V^{i})^{T} and, then only retain its top rank kk approximation as Aki+1=UkiSki(Vki)TA^{i+1}_{k}=U^{i}_{k}S^{i}_{k}(V^{i}_{k})^{T}. This is remarkably similar to Liberty’s algorithm , but is missing the Misra-Gries step (we describe Liberty’s algorithm in more detail in Section 2.2). As a result, incremental PCA can have arbitrarily bad error on adversarial data.

Consider an example where the first kk rows generate a matrix AkA_{k} with kkth singular value σk=10\sigma_{k}=10. Then each row thereafter aia_{i} for i>ki>k is orthogonal to the first kk rows of AA, and has norm 55. This will cause the (k+1)(k+1)th right singular vector and value σk+1\sigma_{k+1} of svd([Aki;ai]){\small\textsf{svd}}([A^{i}_{k};a_{i}]) to exactly describe the subspace of aia_{i} with σk+1=5\sigma_{k+1}=5. Thus this row aia_{i} will always be removed on the processing step and Aki+1A^{i+1}_{k} will be unchanged from AkiA^{i}_{k}. If all rows aia_{i} for i>ki>k are pointing in the same direction, this can cause arbitrarily bad errors of all forms of measuring approximation error considered above.

2 Our Results

Our main result is a deterministic relative error bound for low-rank matrix approximation. A major highlight is that all proofs are, we believe, quite easy to follow.

This is the smallest space streaming algorithm known for these bounds. Also, it is deterministic, whereas previous algorithms were randomized.

We note that it is common for the bounds to be written without squared norms, for instance as AπQk(Q)F(1+ε)AAkF\|A-\pi_{Q_{k}}(Q)\|_{F}\leq(1+\varepsilon)\|A-A_{k}\|_{F}. For ε>0\varepsilon>0, if we take the square root of both sides of the bound above AπQk(A)F2(1+ε)AAkF2\|A-\pi_{Q_{k}}(A)\|_{F}^{2}\leq(1+\varepsilon)\|A-A_{k}\|_{F}^{2}, then we still get a (1+ε)(1+ε)\sqrt{(1+\varepsilon)}\leq(1+\varepsilon) approximation.

No sparse Frequent Directions.

3 Matrix Notation

Here we quickly review some notation. An n×dn\times d matrix AA can be written as a set of nn rows as [a1;a2;,an][a_{1};a_{2};\ldots,a_{n}] where each aia_{i} is a row of length dd. Alternatively a matrix VV can be written as a set of columns [v1,v2,,vd][v_{1},v_{2},\ldots,v_{d}].

The Frobenius norm of a matrix AA is defined AF=i=1ai2\|A\|_{F}=\sqrt{\sum_{i=1}\|a_{i}\|^{2}} where ai\|a_{i}\| is Euclidean norm of aia_{i}. Let AkA_{k} be the best rank kk approximation of the matrix AA, specifically Ak=argmaxC:rank(C)kACFA_{k}={\arg\max}_{C:{\small\textsf{rank}}(C)\leq k}\|A-C\|_{F}.

Given a row rr and a matrix XX let πX(r)\pi_{X}(r) be a projection operation of rr onto the subspace spanned by XX. In particular, we will project onto the row space of XX, and this can be written as πX(r)=rXT(XXT)+X\pi_{X}(r)=rX^{T}(XX^{T})^{+}X where Y+Y^{+} indicates taking the Moore-Penrose psuedoinverse of YY. But whether it projects to the row space or the column space will not matter since we will always use the operator inside of a Frobenius norm.

This operator can be defined to project matrices RR as well, denoted as πX(R)\pi_{X}(R), where this can be thought of as projecting each row of the matrix RR individually.

Review of Related Algorithms

We begin by reviewing two streaming algorithms that our results can be seen as an extension. The first is an algorithm for heavy-hitters from Misra-Gries and its improved analysis by Berinde et al. . We re-prove the relevant part of these results in perhaps a simpler way. Next we describe the algorithm of Liberty for low-rank matrix approximation that our analysis is based on. We again re-prove his result, with a few additional intermediate results we will need for our extended analysis. One familiar with the work of Misra-Gries , Berinde et al. , and Liberty can skip this section, although we will refer to some lemmas re-proven below.

Let A={a1,,an}A=\{a_{1},\ldots,a_{n}\} be a set of nn elements where each ai[u]a_{i}\in[u]. Let fj={aiAai=j}f_{j}=|\{a_{i}\in A\mid a_{i}=j\}| for j[u]j\in[u]. Assume without loss of generality that fjfj+1f_{j}\geq f_{j+1} for all jj, and define Fk=j=1kfjF_{k}=\sum_{j=1}^{k}f_{j}. This is just for notation, and not known ahead of time by algorithms.

If aia_{i} matches a label, increment the associated counter.

If not, and there is an empty counter, change the label of the counter to aia_{i} and set its counter to 11.

Otherwise, if there are no empty counters, then decrement all counters by 11.

To return f^j\hat{f}_{j}, if there is a label with jj, then return the associated counter; otherwise return .

Define F^k=j=1kf^j\hat{F}_{k}=\sum_{j=1}^{k}\hat{f}_{j} and let Rk=j=k+1ufj=nFkR_{k}=\sum_{j=k+1}^{u}f_{j}=n-F_{k}. The value RkR_{k} represents the total counts that cannot be described (even optimally) if we only use kk counters. A bound on FkF^kF_{k}-\hat{F}_{k} in terms of RkR_{k} is more interesting than one in terms of nn, since this algorithm is only useful when there are only really kk items that matter and the rest can be ignored. We next reprove a result of Berinde et al. (in their Appendix A).

This result can be viewed as a warm up for the rank kk matrix approximation to come, as those techniques will follow a very similar strategy.

2 Additive Error Frequent Directions

Recently Liberty discovered how to apply this technique towards sketching matrices. Next we review his approach, and for perspective and completeness re-prove his main results.

Analysis.

Let YjY_{j} be the jjth column of YY, then

Subtracting Qix2\|Q^{i}x\|^{2} from both sides completes the proof. ∎

Notice that Cix2=Qi1x2+aix2\|C^{i}x\|^{2}=\|Q^{i-1}x\|^{2}+\|a_{i}x\|^{2} for all 2in2\leq i\leq n and that Q1x2=a1x2\|Q^{1}x\|^{2}=\|a_{1}x\|^{2}. By substituting this into inequality from Lemma 2.2, we get

Subtracting Qi1x2\|Q^{i-1}x\|^{2} from both sides and summing over ii reveals

Subtracting Qnx2=Qx2\|Q^{n}x\|^{2}=\|Qx\|^{2} from both sides proves the second inequality of the lemma.

To see the first inequality observe Qi1x2+aix2=Cix2Qix2\|Q^{i-1}x\|^{2}+\|a_{i}x\|^{2}=\|C^{i}x\|^{2}\leq\|Q^{i}x\|^{2} for all 1in1\leq i\leq n. Then we can expand

Algorithm 1 maintains for any unit vector xx that

New Relative Error Bounds for Frequent Directions

We now generalize the relative error type bounds for Misra-Gries (in Section 2.1) to the Frequent Directions algorithm (in Section 2.2).

This way QkQ_{k} is also rank kk (and size k×dk\times d), and will have nice approximation properties to AkA_{k}. Recall that Ak=UkΣkVkTA_{k}=U_{k}\Sigma_{k}V_{k}^{T} where [U,Σ,V]=svd(A)[U,\Sigma,V]={\small\textsf{svd}}(A) and UkU_{k}, Σk\Sigma_{k}, VkV_{k} are the first kk columns of these matrices, representing the first kk principal directions. Let V=[v1,,vd]V=[v_{1},\ldots,v_{d}] be the right singular vectors of AA.

Now we can show that projecting AA onto QkQ_{k} provides a relative error approximation.

AπQk(A)F2(1+ε)AAkF2.\|A-\pi_{Q_{k}}(A)\|_{F}^{2}\leq(1+\varepsilon)\|A-A_{k}\|_{F}^{2}.

Using the vectors viv_{i} as right singular vectors of AA, and letting r=rank(A)r={\small\textsf{rank}}(A), then we have

We would also like to relate the Frobenius norm of QkQ_{k} directly to that of AkA_{k}, instead of projecting AA onto it (which cannot be done in a streaming setting, at least not in ω(n)\omega(n) space). However AQkF\|A-Q_{k}\|_{F} does not make sense since QkQ_{k} has a different number of rows than AA. However, we can decompose AAkF2=AF2AkF2\|A-A_{k}\|_{F}^{2}=\|A\|_{F}^{2}-\|A_{k}\|_{F}^{2} since AkA_{k} is a projection of AA onto a (the best rank kk) subspace, and we can use the Pythagorean Theorem. Now we can compare AF2AkF2\|A\|_{F}^{2}-\|A_{k}\|_{F}^{2} to AF2QkF2\|A\|_{F}^{2}-\|Q_{k}\|_{F}^{2}.

AF2AkF2AF2QkF2(1+ε)(AF2AkF2)\|A\|_{F}^{2}-\|A_{k}\|_{F}^{2}\leq\|A\|_{F}^{2}-\|Q_{k}\|_{F}^{2}\leq(1+\varepsilon)(\|A\|_{F}^{2}-\|A_{k}\|_{F}^{2}).

One may ask why not compare AkF\|A_{k}\|_{F} to QkF\|Q_{k}\|_{F} directly, instead of subtracting from AF2\|A\|_{F}^{2}. First note that the above bound does guarantee that AkFQkF\|A_{k}\|_{F}\geq\|Q_{k}\|_{F}. Second, in situations where a rank kk approximation is interesting, then most of the mass from AA should be in its top kk components. Then AkF>AAkF\|A_{k}\|_{F}>\|A-A_{k}\|_{F} so the above bound is actually tighter. To demonstrate this we can state the following conditional statement comparing AkF\|A_{k}\|_{F} and QkF\|Q_{k}\|_{F}.

If AAkFAkF\|A-A_{k}\|_{F}\leq\|A_{k}\|_{F}, then

The second inequality follows from Lemma 3.3, by subtracting AF2\|A\|_{F}^{2}. The first inequality uses Lemma 2.3.

Finally, we summarize all of our bounds about Algorithm 1.

and the projection of QQ along its top kk right singular values is a k×dk\times d matrix QkQ_{k} which satisfies

No Sparse Frequent Directions

In this section we show that this is not possible by extending the frequent directions algorithm.

To make this process work we need the following requirements. Let δi=minjQj(Q)2\delta_{i}=\min_{j}\|{\perp}_{Q_{-j}}(Q)\|^{2} represent the smallest amount of squared norm resulting from removing one row from QQ by the procedure above. Define X(Y){\perp}_{X}(Y) as the orthogonal projection of YY onto XX. It projects each row of YY onto the subspace orthogonal to the basis of XX. It can be interpreted as X(Y)=YπX(Y){\perp}_{X}(Y)=Y-\pi_{X}(Y). Also, we let QjQ_{-j} be the matrix QQ after removing the jjth row. Assume we remove this row, although removing any row is just as difficult but would create even more error.

Hard construction.

In order to satisfy (P2) we consider the vector xx as defined above. We can observe

Acknowledgements

We thank Edo Liberty for encouragement and helpful comments, including pointing out several mistakes in an earlier version of this paper. And thank David P. Woodruff, Christos Boutsidis, Dan Feldman, and Christian Sohler for helping interpret some results.

References

Appendix A Tables of Previous Bounds

In this appendix we try to survey the landscape of work in low-rank matrix approximation. There are many types of bounds, models of construction, and algorithms. We tried to group them into three main categories: Streaming, Fast Runtime, and Column Sampling. We also tried to write bounds in a consistent compatible format. To do so, some parts needed to be slightly simplified – hopefully we got everything correct. The authors will be glad to know if we missed or misrepresented any results.

The size is sometimes measured in terms of the number of columns (#C) and/or the number of rows (#R). Otherwise, if #R or #C is not specified the space refers the number of words in the RAM model where it is assumed O(lognd)O(\log nd) bits fit in a single word.

A projection result builds a subspace GG so that A^=πG(A)\hat{A}=\pi_{G}(A), but does not actually construct πG(A)\pi_{G}(A). This is denoted by P. Ideally rank(G)=k{\small\textsf{rank}}(G)=k. When that is not the case, then it is denoted Pr\textsf{P}_{r} where rr is replaced by the rank of GG.

A construction result builds a series of (usually 3) matrices (say CC, UU, and RR) where A^=CUR\hat{A}=CUR. Note again, it does not construct A^\hat{A} since it may be of larger size than all of CC, UU, and RR together, but the three matrices can be used in place of A^\hat{A}. This is denoted C.

ε\varepsilon-relative error is of the form AA^F(1+ε)AAkF\|A-\hat{A}\|_{F}\leq(1+\varepsilon)\|A-A_{k}\|_{F} where AkA_{k} is the best rank-kk approximation to AA. This is denoted ε\varepsilonR.

ε\varepsilon-additive error is of the form AA^F2AAkF2+εAF2\|A-\hat{A}\|^{2}_{F}\leq\|A-A_{k}\|^{2}_{F}+\varepsilon\|A\|^{2}_{F}. This is denoted ε\varepsilonA. This can sometimes also be expressed as a spectral norm of the form AA^22AAk22+εAF2\|A-\hat{A}\|^{2}_{2}\leq\|A-A_{k}\|^{2}_{2}+\varepsilon\|A\|^{2}_{F} (note the error term εAF2\varepsilon\|A\|^{2}_{F} still has a Frobenius norm). This is denoted ε\varepsilonL2.

In a few cases the error does not follow these patterns and we specially denote it.

Algorithms are randomized unless it is specified. In all tables we state bounds for a constant probability of failure. If we want to decrease the probability of failure to some parameter δ\delta, we can generally increase the size and runtime by O(log(1/δ))O(\log(1/\delta)).