Simple and Deterministic Matrix Sketching

Edo Liberty

Introduction

Efficiently obtaining compact approximations, or sketches, of large matrices is a very common and important problem. It is essential for obtaining approximate matrix Singular Value Decompositions (SVD) or low rank approximations of large matrices. It is used in large scale data mining, Latent Semantic Indexing (LSI), $k$ -means clustering, Principal Component Analysis (PCA), Linear regression, solving large linear systems, matrix preconditioning, and many other important and commonly performed tasks.

Moreover, modern large data sets are often viewed as large matrices. For example, textual data in the bag-of-words model is stored such that each row in the data matrix corresponds to one document and non zero entries correspond to words in the documents. Such matrices are often extremely large and distributed across many machines. Sketching methods, therefore, are designed to be pass efficient which means the data is read at most a constant number of times. If only one pass is required the computational model is also referred to as the streaming model. The streaming model is especially attractive since the analysis can be performed immediately when the data is collected. In that case its storage can be eliminated altogether.

There are three main matrix sketching approaches which are presented here in an arbitrary order. The first generates a sparser version of the matrix. Sparser matrices are stored more efficiently and can be multiplied by other matrices faster . The second approach is to randomly combine matrix rows . These methods rely on properties of random low dimensional subspaces and strong concentration of measure phenomena. The third is to find a small subset of matrix rows (or columns) which approximate the entire matrix. This problem is known as the Column Subset Selection Problem and has been thoroughly investigated . Recent results obtain algorithms with almost matching lower bounds . Alas, It is not immediately clear how to compare these methods’ results to ours since their objectives are different. They aim to recover a low rank matrix whose column space contains most of space spanned by the matrix top $k$ singular vectors. Moreover, most of the above algorithms require several passes over the input matrix. A simple streaming solution to the Column Subset Selection problem is obtained by sampling columns (rows in this case) from the input matrix. The rows are sampled with probability proportional to their squared norm. Despite this algorithm’s apparent simplicity, providing tight bounds for its performance required over a decade of research .

This manuscript proposes a forth approach which draws on the matrix sketching problem’s similarity to the item frequency estimation problem. In what follows, we shortly describe the item frequency approximation problem and a well known solution for it which our proposed algorithm will be based on.

In the item frequency approximation problem there is a universe of $m$ items $a_{1},\ldots,a_{m}$ and a stream $A_{1},\dots,A_{n}$ of item appearances. The frequency of an item is the number of times it appears in the stream. It is trivial to produce these counts using $O(m)$ space simply by keeping a counter for each item. Our goal is to use $o(m)$ space and produce approximate frequencies $g_{j}$ such that $|f_{j}-g_{j}|\leq\varepsilon n$ for all $j$ and prescribed precision $\varepsilon$ .

The algorithm

Let $B$ be the result of applying Algorithm 1 to matrix $A$ with prescribed precision parameter $\varepsilon$ then:

We begin by obtaining the value of $\sum_{i=1}^{n}\delta_{i}$ by computing the Frobenius norm of $B$ . Let $B^{i}$ , $C^{i}$ and $V^{i}$ be the values of $B$ , $C$ and $V$ after the main loop in the algorithm is executed $i$ times. For example, $B^{0}$ is an all zeros matrix and $B^{n}$ is the returned sketch matrix.

2 Parallelization and sketching sketches

A convenient property of this sketching technique is that it allows for combining sketches. In other words, let $A_{1}$ and $A_{2}$ denote two halves of a larger matrix $A$ . Also, let $B_{1}$ and $B_{2}$ be the sketches computed by the above technique for $A_{1}$ and $A_{2}$ respectively. Now let the final sketch, $C$ , be the sketch of a matrix $B$ which contains both $B_{1}$ and $B_{2}$ . It still holds that $\|A^{T}A-C^{T}C\|\leq\varepsilon\cdot tr(A^{T}A-C^{T}C)$ . To see this we compute $\|Cx\|^{2}$ for a test vector $\|x\|=1$ .

Here we use the fact that $\|B_{1}x\|^{2}\geq\|A_{1}x\|^{2}-\varepsilon(\|A_{1}\|_{f}^{2}-\|B_{1}\|_{f}^{2})$ for $\|x\|=1$ which is a consequence of the derivation above. This property is especially useful when the matrix (or data) is distributed across many machines which is often the case in modern large scale data.

3 Connection to matrix low rank approximation

Experiments

Discussion

This paper draws upon a surprising similarity between two problems, the item frequency approximation problem and the matrix sketching problem. It seems that, in general, solutions to the first can be modified to solve the second but incur an additional factor of $m$ in both running time and space requirement. This is true, for example, about sampling. It is also the case for the memory footprint of Frequent-Items which is $O(1/\varepsilon)$ while for Frequent-Directions it is $O(m/\varepsilon)$ . But, the update time of Frequent-Items is $O(1)$ and that of Frequent-Directions is $O(m/\varepsilon)$ . It is natural to seek a modified algorithm which exhibits an $O(m)$ update time. Another question is whether more advanced algorithms for fining frequent items in streams could also be carried over. A good candidate is the Count Sketch algorithm . Alas, it depends on item hashing in a way which does not naturally translate to the matrix sketching domain.

Acknowledgments: The author truly thanks Petros Drineas, Jelani Nelson, Nir Ailon, Zohar Karnin, and Yoel Shkolnisky for very helpful discussions and pointers.