Relative Errors for Deterministic Low-Rank Matrix Approximations
Mina Ghashami, Jeff M. Phillips
Introduction
The data streaming paradigm considers computation on a large data set where one data item arrives at a time, is processed, and then is not read again. It enforces that only a small amount of memory is available at any given time. This small space constraint is critical when the full data set cannot fit in memory or disk. Typically, the amount of space required is traded off with the accuracy of the computation on . Usually the computation results in some summary of , and this trade-off determines how accurate one can be with the available space resources. Although computational runtime is important, in this paper we mainly focus on space constraints and the types of approximation guarantees that can be made.
In truly large datasets, one processor (and memory) is often incapable of handling all of the dataset in a feasible amount of time. Even reading a terabyte of data on a single processor can take many hours. Thus this computation is often spread among some set of processors, and then the summary of is combined after (or sometimes during ) its processing on each processor. Again often each item is read once, whether it comes from a single large source or is being generated on the fly. The key computational problem shifts from updating a summary when witnessing a single new data item (the streaming model), to taking two summaries and and constructing a single new summary . In this new paradigm the goal is to have the same space-approximation trade-offs in as possible for a streaming algorithm. When such a process is possible, the summary is known as a mergeable summary . Linear sketches are trivially mergeable, so this allows many streaming algorithms to directly translate to this newer paradigm. Again, space is a critical resource since it directly corresponds with the amount of data needed to transmit across the network, and emerging cost bottleneck in big data systems.
In this paper we focus on deterministic mergeable summaries for low-rank matrix approximation, based on recent work by Liberty , that is already known to be mergeable . Thus our focus is a more careful analysis of the space-error trade off for the algorithm, and we describe them under the streaming setting for simplicity; all bounds directly carry over into mergeable summary results.
In particular we re-analyze the Frequent Directions algorithm of Liberty to show it provides relative error bounds for matrix sketching, and conjecture it achieves the optimal space, up to factors, for any row-update based summary. This supports the strong empirical results of Liberty . His analysis only provided additive error bounds which are hard to compare to more conventional ways of measuring accuracy of matrix approximation algorithms.
In this problem is an matrix and the stream processes each row (of length ) at a time. Typically the matrix is assumed to be tall so , and sometimes the matrix will be assumed to be sparse so the number of non-zero entries of will be small, (e.g. .
The best rank- approximation to (under Frobenius or 2 norm) is denoted as and can be computed in time on a tall matrix using the singular value decomposition. The produces three matrices , , and where and are orthonormal, of size and , respectively, and is but only has non-zero elements on its diagonal . Let , , and be the first columns of each matrix, then and . Note that although requires space, the set of matrices require only a total of space (or if the matrix is tall). Moreover, even the set really only takes space since we can drop the last columns of , and the last rows of without changing the result. In the streaming version, the goal is to compute something that replicates the effect of using less space and only seeing each row once.
The strongest version, (providing construction bounds) for some parameter , is some representation of a rank matrix such that for . Unless is sparse, then storing explicitly may require space, so that is why various representations of are used in its place. This can include decompositions similar to the SVD, e.g. a CUR decomposition where and where is small and dense, and and are sparse and skinny, or others where the middle matrix is still diagonal. The sparsity is often preserved by constructing the wrapper matrices (e.g. and ) from the original columns or rows of . There is an obvious space bound for any construction result in order to preserve the column and the row space.
Projection bounds.
Streaming algorithms.
Many of these algorithms are not streaming algorithms. To the best of our understanding, the best streaming algorithm is due to Clarkson and Woodruff. All bounds assume each matrix entry requires bits. It is randomized and it constructs a decomposition of a rank matrix that satisfies , with probability at least . This provides a relative error construction bound of size bits. They also show an bits lower bound.
Although not explicitly described in their paper, one can directly use their techniques and analysis to achieve a weak form of a projection bound. One maintains a matrix with columns where is a matrix where each entry is chosen from at random. Then setting , achieves a projection bound, however is rank and hence that is the only bound on as well. The construction lower bound suggests that there is an bits lower bound for projection, but this is not directly proven.
They also study this problem in the turnstile model where each element of the matrix can be updated at each step (including subtractions). In this model they require bits, and show an bits lower bound.
Another more general “coreset” result is provided by Feldman et al. . In the streaming setting it requires space and can be shown to provide a rank matrix that satisfies a relative error bound of the form .
Column sampling.
Although not typically described as streaming algorithms (perhaps because the focus was on sampling columns which already have length ) when a matrix is processed row wise there exists algorithms that can use reservoir sampling to become streaming. The best streaming algorithm samples rows (proportional to their squared norm) to obtain a matrix so that , a weaker additive error bound. These techniques can also build approximate decompositions of instead of using , but again these decompositions are only known to work with at least passes, and are thus not streaming.
Other.
There is a wealth of literature on this problem; most recently two algorithms showed how to construct a decomposition of that has rank with error bound with constant probability in approximately time. We refer to these papers for a more thorough survey of the history of the area, many other results, and other similar approximate linear algebra applications. But we attempt to report many of the most important related results in Appendix A.
Finally we mention a recent algorithm by Liberty which runs in time, maintains a matrix with rows in a row-wise streaming algorithm, and produces a matrix of rank at most so that for any unit vector of length satisfies . We examine a slight variation of this algorithm and describe bounds that it achieves in more familiar terms.
Incremental PCA.
We mention one additional line of work on incremental PCA . These approaches attempt to maintain the PCA of a dataset (using SVD and a constant amount of additional bookkeeping space) as each row of arrives in a stream. In particular, after rows they consider maintaining , and on a new row compute and, then only retain its top rank approximation as . This is remarkably similar to Liberty’s algorithm , but is missing the Misra-Gries step (we describe Liberty’s algorithm in more detail in Section 2.2). As a result, incremental PCA can have arbitrarily bad error on adversarial data.
Consider an example where the first rows generate a matrix with th singular value . Then each row thereafter for is orthogonal to the first rows of , and has norm . This will cause the th right singular vector and value of to exactly describe the subspace of with . Thus this row will always be removed on the processing step and will be unchanged from . If all rows for are pointing in the same direction, this can cause arbitrarily bad errors of all forms of measuring approximation error considered above.
2 Our Results
Our main result is a deterministic relative error bound for low-rank matrix approximation. A major highlight is that all proofs are, we believe, quite easy to follow.
This is the smallest space streaming algorithm known for these bounds. Also, it is deterministic, whereas previous algorithms were randomized.
We note that it is common for the bounds to be written without squared norms, for instance as . For , if we take the square root of both sides of the bound above , then we still get a approximation.
No sparse Frequent Directions.
3 Matrix Notation
Here we quickly review some notation. An matrix can be written as a set of rows as where each is a row of length . Alternatively a matrix can be written as a set of columns .
The Frobenius norm of a matrix is defined where is Euclidean norm of . Let be the best rank approximation of the matrix , specifically .
Given a row and a matrix let be a projection operation of onto the subspace spanned by . In particular, we will project onto the row space of , and this can be written as where indicates taking the Moore-Penrose psuedoinverse of . But whether it projects to the row space or the column space will not matter since we will always use the operator inside of a Frobenius norm.
This operator can be defined to project matrices as well, denoted as , where this can be thought of as projecting each row of the matrix individually.
Review of Related Algorithms
We begin by reviewing two streaming algorithms that our results can be seen as an extension. The first is an algorithm for heavy-hitters from Misra-Gries and its improved analysis by Berinde et al. . We re-prove the relevant part of these results in perhaps a simpler way. Next we describe the algorithm of Liberty for low-rank matrix approximation that our analysis is based on. We again re-prove his result, with a few additional intermediate results we will need for our extended analysis. One familiar with the work of Misra-Gries , Berinde et al. , and Liberty can skip this section, although we will refer to some lemmas re-proven below.
Let be a set of elements where each . Let for . Assume without loss of generality that for all , and define . This is just for notation, and not known ahead of time by algorithms.
If matches a label, increment the associated counter.
If not, and there is an empty counter, change the label of the counter to and set its counter to .
Otherwise, if there are no empty counters, then decrement all counters by .
To return , if there is a label with , then return the associated counter; otherwise return .
Define and let . The value represents the total counts that cannot be described (even optimally) if we only use counters. A bound on in terms of is more interesting than one in terms of , since this algorithm is only useful when there are only really items that matter and the rest can be ignored. We next reprove a result of Berinde et al. (in their Appendix A).
This result can be viewed as a warm up for the rank matrix approximation to come, as those techniques will follow a very similar strategy.
2 Additive Error Frequent Directions
Recently Liberty discovered how to apply this technique towards sketching matrices. Next we review his approach, and for perspective and completeness re-prove his main results.
Analysis.
Let be the th column of , then
Subtracting from both sides completes the proof. ∎
Notice that for all and that . By substituting this into inequality from Lemma 2.2, we get
Subtracting from both sides and summing over reveals
Subtracting from both sides proves the second inequality of the lemma.
To see the first inequality observe for all . Then we can expand
Algorithm 1 maintains for any unit vector that
New Relative Error Bounds for Frequent Directions
We now generalize the relative error type bounds for Misra-Gries (in Section 2.1) to the Frequent Directions algorithm (in Section 2.2).
This way is also rank (and size ), and will have nice approximation properties to . Recall that where and , , are the first columns of these matrices, representing the first principal directions. Let be the right singular vectors of .
Now we can show that projecting onto provides a relative error approximation.
Using the vectors as right singular vectors of , and letting , then we have
We would also like to relate the Frobenius norm of directly to that of , instead of projecting onto it (which cannot be done in a streaming setting, at least not in space). However does not make sense since has a different number of rows than . However, we can decompose since is a projection of onto a (the best rank ) subspace, and we can use the Pythagorean Theorem. Now we can compare to .
.
One may ask why not compare to directly, instead of subtracting from . First note that the above bound does guarantee that . Second, in situations where a rank approximation is interesting, then most of the mass from should be in its top components. Then so the above bound is actually tighter. To demonstrate this we can state the following conditional statement comparing and .
If , then
The second inequality follows from Lemma 3.3, by subtracting . The first inequality uses Lemma 2.3.
Finally, we summarize all of our bounds about Algorithm 1.
and the projection of along its top right singular values is a matrix which satisfies
No Sparse Frequent Directions
In this section we show that this is not possible by extending the frequent directions algorithm.
To make this process work we need the following requirements. Let represent the smallest amount of squared norm resulting from removing one row from by the procedure above. Define as the orthogonal projection of onto . It projects each row of onto the subspace orthogonal to the basis of . It can be interpreted as . Also, we let be the matrix after removing the th row. Assume we remove this row, although removing any row is just as difficult but would create even more error.
Hard construction.
In order to satisfy (P2) we consider the vector as defined above. We can observe
Acknowledgements
We thank Edo Liberty for encouragement and helpful comments, including pointing out several mistakes in an earlier version of this paper. And thank David P. Woodruff, Christos Boutsidis, Dan Feldman, and Christian Sohler for helping interpret some results.
References
Appendix A Tables of Previous Bounds
In this appendix we try to survey the landscape of work in low-rank matrix approximation. There are many types of bounds, models of construction, and algorithms. We tried to group them into three main categories: Streaming, Fast Runtime, and Column Sampling. We also tried to write bounds in a consistent compatible format. To do so, some parts needed to be slightly simplified – hopefully we got everything correct. The authors will be glad to know if we missed or misrepresented any results.
The size is sometimes measured in terms of the number of columns (#C) and/or the number of rows (#R). Otherwise, if #R or #C is not specified the space refers the number of words in the RAM model where it is assumed bits fit in a single word.
A projection result builds a subspace so that , but does not actually construct . This is denoted by P. Ideally . When that is not the case, then it is denoted where is replaced by the rank of .
A construction result builds a series of (usually 3) matrices (say , , and ) where . Note again, it does not construct since it may be of larger size than all of , , and together, but the three matrices can be used in place of . This is denoted C.
-relative error is of the form where is the best rank- approximation to . This is denoted R.
-additive error is of the form . This is denoted A. This can sometimes also be expressed as a spectral norm of the form (note the error term still has a Frobenius norm). This is denoted L2.
In a few cases the error does not follow these patterns and we specially denote it.
Algorithms are randomized unless it is specified. In all tables we state bounds for a constant probability of failure. If we want to decrease the probability of failure to some parameter , we can generally increase the size and runtime by .