Simple and Deterministic Matrix Sketching

Edo Liberty

Introduction

Efficiently obtaining compact approximations, or sketches, of large matrices is a very common and important problem. It is essential for obtaining approximate matrix Singular Value Decompositions (SVD) or low rank approximations of large matrices. It is used in large scale data mining, Latent Semantic Indexing (LSI), kk-means clustering, Principal Component Analysis (PCA), Linear regression, solving large linear systems, matrix preconditioning, and many other important and commonly performed tasks.

Moreover, modern large data sets are often viewed as large matrices. For example, textual data in the bag-of-words model is stored such that each row in the data matrix corresponds to one document and non zero entries correspond to words in the documents. Such matrices are often extremely large and distributed across many machines. Sketching methods, therefore, are designed to be pass efficient which means the data is read at most a constant number of times. If only one pass is required the computational model is also referred to as the streaming model. The streaming model is especially attractive since the analysis can be performed immediately when the data is collected. In that case its storage can be eliminated altogether.

There are three main matrix sketching approaches which are presented here in an arbitrary order. The first generates a sparser version of the matrix. Sparser matrices are stored more efficiently and can be multiplied by other matrices faster . The second approach is to randomly combine matrix rows . These methods rely on properties of random low dimensional subspaces and strong concentration of measure phenomena. The third is to find a small subset of matrix rows (or columns) which approximate the entire matrix. This problem is known as the Column Subset Selection Problem and has been thoroughly investigated . Recent results obtain algorithms with almost matching lower bounds . Alas, It is not immediately clear how to compare these methods’ results to ours since their objectives are different. They aim to recover a low rank matrix whose column space contains most of space spanned by the matrix top kk singular vectors. Moreover, most of the above algorithms require several passes over the input matrix. A simple streaming solution to the Column Subset Selection problem is obtained by sampling columns (rows in this case) from the input matrix. The rows are sampled with probability proportional to their squared norm. Despite this algorithm’s apparent simplicity, providing tight bounds for its performance required over a decade of research .

This manuscript proposes a forth approach which draws on the matrix sketching problem’s similarity to the item frequency estimation problem. In what follows, we shortly describe the item frequency approximation problem and a well known solution for it which our proposed algorithm will be based on.

In the item frequency approximation problem there is a universe of mm items a1,,ama_{1},\ldots,a_{m} and a stream A1,,AnA_{1},\dots,A_{n} of item appearances. The frequency of an item is the number of times it appears in the stream. It is trivial to produce these counts using O(m)O(m) space simply by keeping a counter for each item. Our goal is to use o(m)o(m) space and produce approximate frequencies gjg_{j} such that fjgjεn|f_{j}-g_{j}|\leq\varepsilon n for all jj and prescribed precision ε\varepsilon.

The algorithm

Let BB be the result of applying Algorithm 1 to matrix AA with prescribed precision parameter ε\varepsilon then:

We begin by obtaining the value of i=1nδi\sum_{i=1}^{n}\delta_{i} by computing the Frobenius norm of BB. Let BiB^{i}, CiC^{i} and ViV^{i} be the values of BB, CC and VV after the main loop in the algorithm is executed ii times. For example, B0B^{0} is an all zeros matrix and BnB^{n} is the returned sketch matrix.

2 Parallelization and sketching sketches

A convenient property of this sketching technique is that it allows for combining sketches. In other words, let A1A_{1} and A2A_{2} denote two halves of a larger matrix AA. Also, let B1B_{1} and B2B_{2} be the sketches computed by the above technique for A1A_{1} and A2A_{2} respectively. Now let the final sketch, CC, be the sketch of a matrix BB which contains both B1B_{1} and B2B_{2}. It still holds that ATACTCεtr(ATACTC)\|A^{T}A-C^{T}C\|\leq\varepsilon\cdot tr(A^{T}A-C^{T}C). To see this we compute Cx2\|Cx\|^{2} for a test vector x=1\|x\|=1.

Here we use the fact that B1x2A1x2ε(A1f2B1f2)\|B_{1}x\|^{2}\geq\|A_{1}x\|^{2}-\varepsilon(\|A_{1}\|_{f}^{2}-\|B_{1}\|_{f}^{2}) for x=1\|x\|=1 which is a consequence of the derivation above. This property is especially useful when the matrix (or data) is distributed across many machines which is often the case in modern large scale data.

3 Connection to matrix low rank approximation

Experiments

Discussion

This paper draws upon a surprising similarity between two problems, the item frequency approximation problem and the matrix sketching problem. It seems that, in general, solutions to the first can be modified to solve the second but incur an additional factor of mm in both running time and space requirement. This is true, for example, about sampling. It is also the case for the memory footprint of Frequent-Items which is O(1/ε)O(1/\varepsilon) while for Frequent-Directions it is O(m/ε)O(m/\varepsilon). But, the update time of Frequent-Items is O(1)O(1) and that of Frequent-Directions is O(m/ε)O(m/\varepsilon). It is natural to seek a modified algorithm which exhibits an O(m)O(m) update time. Another question is whether more advanced algorithms for fining frequent items in streams could also be carried over. A good candidate is the Count Sketch algorithm . Alas, it depends on item hashing in a way which does not naturally translate to the matrix sketching domain.

Acknowledgments: The author truly thanks Petros Drineas, Jelani Nelson, Nir Ailon, Zohar Karnin, and Yoel Shkolnisky for very helpful discussions and pointers.

References