OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings
Jelani Nelson, Huy L. Nguyen
Introduction
There has been much recent work on applications of dimensionality reduction to handling large datasets. Typically special features of the data such as low “intrinsic” dimensionality, or sparsity, are exploited to reduce the volume of data before processing, thus speeding up analysis time. One success story of this approach is the applications of fast algorithms for the Johnson-Lindenstrauss lemma [JL84], which allows one to reduce the dimension of a set of vectors while preserving all pairwise distances. There have been two popular lines of work in this area: one focusing on fast embeddings for all vectors [AC09, AL09, AL11, HV11, KMR12, KW11, Vyb11], and one focusing on fast embeddings specifically for sparse vectors [Ach03, BOR10, DKS10, KN10, KN12].
In this work we focus on the problem of constructing an oblivious subspace embedding (OSE) [Sar06] and on applications of these embeddings. Roughly speaking, the problem is to design a data-independent distribution over linear mappings such that when data come from an unknown low-dimensional subspace, they are reduced to roughly their true dimension while their structure (all distances in the subspace in this case) is preserved at the same time. It can be seen as a continuation of the approach based on the Johnson-Lindenstrauss lemma to subspaces. Here we focus on the setting of sparse inputs, where it is important that the algorithms take time proportional to the input sparsity. These embeddings have found applications in numerical linear algebra problems such as least squares regression, low rank approximation, and approximating leverage scores [CW09, CW12, DMIMW12, NDT09, Sar06, Tro11]. We refer the interested reader to the surveys [HMT11, Mah11] for an overview of this area.
Here are given parameters of the problem and we would like as small as possible.
The running time of the above scheme to compute seems almost linear, and thus nearly optimal, since the input size is already to describe . While this is true for dense , in many practical instances one often expects the input matrix to be sparse, in which case linear time in the input description actually means , where is the number of non-zero entries. For example consider the case of being the Netflix matrix, where is user ’s score for movie : is very sparse since most users do not watch, let alone score, most movies [ZWSP08].
In the following paragraph we let be the space required to store implicitly (e.g. store the seed to some hash function that specifies ). We let be the running time required by an algorithm which, given a column index and the length- seed specifying , returns the list of all non-zeroes in that column in .
1 Problem Statements and Bounds
We now formally define all numerical linear algebra problems we consider. Plugging our new OSE’s into previous algorithms for the above problems yields the bounds in Figure 1; the value used in bounds denotes .
Least Squares Regression:
Low Rank Approximation:
2 Our Construction and Techniques
Bounding this latter quantity is a simple calculation and fits in under a page (Theorem 3).
The two constructions with smaller are the sparse Johnson-Lindenstrauss matrices of [KN12]. In particular, the only properties we need from our OSE in our analyses are the following. Let each matrix in the support of the OSE have entries in . For a randomly drawn , let be an indicator random variable for the event , and write , where the are random signs. Then the properties we need are
For any , with probability .
The second property says the are negatively correlated. We call any matrix drawn from an OSE with the above properties an oblivious sparse norm-approximating projection (OSNAP).
The work of [KN12] gave two OSNAP distributions, either of which suffice for our current OSE problem. In the first construction, each column is chosen to have exactly non-zero entries in random locations, each equal to uniformly at random. For our purposes the signs need only be -wise independent, and each column can be specified by a -wise independent permutation, and the seeds specifying the permutations in different columns need only be -wise independent. In the second construction we pick hash functions , , both -wise independent, and thus each representable using random bits. For each we set , and all other entries in are set to zero. Note also that the TZ sketch is itself an OSNAP with .
Our approach follows the classical moment method in random matrix theory; see [Tao12, Section 2] or [Ver12] introductions to this area. In particular, our approach is inspired by one taken by Bai and Yin [BY93], who in our notation were concerned with the case , , dense. Most of the complications in our proof arise because is not the identity matrix, so that rows of are not orthogonal. For example, in the case of having orthogonal rows all graphs in the last paragraph have no edges other than self-loops and are trivial to analyze.
Analysis
Let be Hermitian matrices where has eigenvalues , has eigenvalues , and has eigenvalues . Then , it holds that .
Let . Letting be the identity matrix, Weyl’s inequality with , , and implies that all the eigenvalues of lie in the range , where (resp. ) is the smallest (resp. largest) eigenvalue of . Since , it thus suffices to show
since implies that all eigenvalues of lie in .
Before proceeding with our proofs below, observe that for all
Noting and for , we have for all
For an OSNAP with and , with probability at least all singular values of are as long as , is -wise independent, and is pairwise independent.
Noting we have that , so
Before proving the next theorem, it is helpful to state a few facts that we will repeatedly use. Recall that denotes the th column of , and we will let denote the th row of .
and this inner product is for and otherwise.
A multigraph has edge-disjoint spanning trees iff
for every partition of the vertex set of , where is the set of edges of crossing between two different partitions in .
The following corollary is standard, and we will later only need it for the case .
Let be a multigraph formed by removing at most edges from a multigraph that has edge-connectivity at least . Then must have at least edge-disjoint spanning trees.
Proof. For any partition of the vertex set, each partition must have at least edges leaving it in . Thus the number of edges crossing partitions must be at least in , and thus at least in . Theorem 6 thus implies that has edge-disjoint spanning trees.
Proof. We have since . To show that unit norm exist which achieve , let be the singular value decomposition of . That is, are unitary and is diagonal with entries so that . We can then achieve by letting be the first column of and be the first column of .
For an OSNAP with and , with probability at least , all singular values of are as long as and are -wise independent.
Proof. We will again show Eq. (2). Recall that by Eq. (1) we have
Now we make some observations. Due to the random signs , a monomial has expectation zero unless every bond in has even multiplicity, in which case the product of random signs in is . Also, note the expectation of the product of the terms in is at most by OSNAP properties. Thus letting be the set of all such graphs with even bond multiplicity in that arise from some monomial appearing in Eq. (6), we have
Before continuing further it will be convenient to introduce a notion we will use in our analysis called a generalized dot product multigraph. Such a graph is just as in the case of a dot product multigraph, except that each edge is associated with some matrix . We call the edge-matrix of . Also since is undirected, we can think of an edge with edge-matrix also as an edge , in which case we say its associated edge-matrix is . We then associate with the product
Note that a dot product multigraph is simply a generalized dot product multigraph in which for all . Also, in such a generalized dot product multigraph, we treat multiedges as representing the same bond iff the associated edge-matrices are also equal (in general multiedges may have different edge-matrices).
Let be a connected generalized dot product multigraph on vertex set with and where every bond has even multiplicity. Also suppose that for all , . Define
where for , and equals some fixed vector with . Then .
Proof. Let be some permutation of . For a bond , let denote the multiplicity of in . Then by ordering the assignments of the in the summation
according to , we obtain the exactly equal expression
Here we have taken the product over as opposed to since there may be self-loops. By Lemma 5 and the fact that we have that for any , , so we obtain an upper bound on Eq. (8) by replacing each term with . We can thus obtain the sum
which upper bounds Eq. (8). Now note for that for any nonnegative integer and for non-empty (note the strict inequality ),
where Eq. (10) used Lemma 5, Eq. (11) used Lemma 4, and Eq. (12) used that . Now consider processing the alternating sum-product in Eq. (9) from right to left. We say that a bond is assigned to if . When arriving at the th sum-product and using the upper bound Eq. (11) on the previous sum-products, we will have a sum over raised to some nonnegative power (specifically the number of bonds incident upon but not assigned to , plus one if has a self-loop) multiplied by a product of over all bonds assigned to . There are two cases. In the first case has no bonds assigned to it. We will ignore this case since we will show that we can choose to avoid it.
The other case is that has at least one bond assigned to it. In this case we are in the scenario of Eq. (11) and thus summing over yields a non-empty product of for the for which is a bond assigned to . Thus in our final sum, as long as we choose to avoid the first case, we are left with an upper bound of raised to some power equal to the edge-degree of vertex in , which is at least . The lemma would then follow since for .
It now remains to show that we can choose to avoid the first case where some is such that has no bonds assigned to it. Let be a spanning tree in rooted at vertex . We then choose any with the property that for any , is not an ancestor of in . This can be achieved, for example, by assigning values in reverse breadth first search order.
Let be any dot product graph as in Eq. (7). Then
Proof. We first note that we have the inequality
We can view the sum over on the right hand side of the above as creating new dot product multigraphs, each with one fewer vertex where we eliminated vertex and associated it with vertex for some , and for each edge we effectively replaced it with . Also in first sum where we sum over all values of , we have eliminated the constraints for . By recursively applying this inequality to each of the resulting summations, we bound
by a sum of contributions from dot product multigraphs where in none of these multigraphs do we have the constraint that for . We will show that each one of these resulting multigraphs contributes at most , from which the lemma follows.
Let be one of the dot product multigraphs at a leaf of the above recursion so that we now wish to bound
where for all for . Before proceeding, we first claim that every connected component of is Eulerian. To see this, observe has an Eulerian tour, by following the edges of in increasing order of label, and thus all middle vertices have even edge-degree in . However they also have even edge-degree in , and thus the edge-degree of a middle vertex in must be even as well. Thus, every vertex in has even edge-degree, and thus every vertex in each of the recursively created leaf graphs also has even edge-degree since at every step when we eliminate a vertex, some other vertex’s degree increases by the eliminated vertex’s degree which was even. Thus every connected component of is Eulerian as desired.
We now upper bound . Let the connected components of be , where counts connected components. An observation we repeatedly use later is that for any generalized dot product multigraph with components ,
We treat as a generalized dot product multigraph so that each edge has an associated matrix (though in fact for all ). Define an undirected multigraph to be good if all its connected components have two edge-disjoint spanning trees. We will show that for some generalized dot product multigraph that is good then will show . If itself is good then we can set . Otherwise, we will show for smaller and smaller generalized dot product multigraphs (i.e. with successively fewer vertices) whilst maintaining the invariant that each has Eulerian connected components and has for all . We stop when some is good and we can set .
Let us now focus on constructing this sequence of in the case that is not good. Let . Suppose we have constructed for none of which are good, and now we want to construct . Since is not good it cannot be -edge-connected by Corollary 7, so there is some connected component of with some cut with edges crossing the cut (note that since is Eulerian, any cut has an even number of edges crossing it). Choose such an with minimum amongst all such cuts. Let the two edges crossing the cut be with (note that it may be the case that and/or ). Note that equals the magnitude of
We define to be but where in the th component we replace with and add an additional edge from to which we assign edge-matrix . We thus have that by Eq. (14). Furthermore each component of is still Eulerian since every vertex in has either been eliminated, or its edge-degree has been preserved and thus all edge-degrees are even. It remains to show that .
We first claim that has two edge-disjoint spanning trees. Define to be the graph with an edge from to added. We show that is -edge-connected so that has two edge-disjoint spanning trees by Corollary 7. Now to see this, consider some . Consider the cut . is Eulerian, so the number of edges crossing this cut is either or at least . If it , then since this is a contradiction since was chosen amongst such cuts to have minimum. Thus it is at least , and we claim that the number of edges crossing the cut in must also be at least . If not, then it is since is Eulerian. However since the number of edges leaving in is at least , it must then be that . But then the cut has edges crossing it so that is a smaller cut than with edges leaving it in , violating the minimality of , a contradiction. Thus is -edge-connected, implying has two edge-disjoint spanning trees as desired.
Now to show , by Fact 8 we have . We have that
where Eq. (16) used the AM-GM inequality, and Eq. (17) used Lemma 10 (note the graph with vertex set and edge set is connected since ). Thus we have shown that satisfies the desired properties. Now notice that the sequence must eventually terminate since the number of vertices is strictly decreasing in this sequence and any Eulerian graph on vertices is good. Therefore we have that is eventually good for some and we can set .
It remains to show that for our final good we have . We will show this in two parts by showing that both and . For the first claim, note that since every has the same number of connected components as , and . This latter inequality holds since in each level of recursion used to eventually obtain from , we repeatedly identified two vertices as equal and merged them, which can only decrease the number of connected components. Now, all middle vertices in lie in one connected component (since is connected) and has connected components. Thus the at least edges connecting these components in must come from , implying that (and thus ) has at most connected components, which thus must also be true for as argued above.
It only remains to show . Let have connected components with each having edge-disjoint spanning trees . We then have
where Eq. (18) used the AM-GM inequality, and Eq. (19) used Lemma 10, which applies since with edge set is connected, and with edge set is connected (since ).
Let be arbitrary constants. For an OSNAP with and , with probability at least , all singular values of are for and being -wise independent. The constants in the big- and big- depend on .
Applications
We use the fact that many matrix problems have the same time complexity as matrix multiplication including computing the matrix inverse [BH74][Har08, Appendix A], and QR decomposition [Sch73]. In this paper we only consider the real RAM model and state the running time in terms of the number of field operations. The algorithms for solving linear systems, computing inverse, QR decomposition, and approximating SVD based on fast matrix multiplication can be implemented with precision comparable to that of conventional algorithms to achieve the same error bound (with a suitable notion of approximation/stability). We refer readers to [DDH07] for details. Notice that it is possible that both algorithms based on fast matrix multiplication and conventional counterparts are unstable, see e.g. [AV97] for an example of a pathological matrix with very high condition number.
In this section we describe some applications of our subspace embeddings to problems in numerical linear algebra. All applications follow from a straightforward replacement of previously used embeddings with our new ones as most proofs go through verbatim. In the statement of our bounds we implicitly assume , since otherwise fully zero rows of can be ignored without affecting the problem solution.
This section describes the application of our subspace embedding from Theorem 9 or Theorem 12 to approximating the leverage scores. Consider a matrix of size and rank . Let be a matrix whose columns form an orthonormal basis of the column space of . The leverage scores of are the squared lengths of the rows of . The algorithm for approximating the leverage scores and the analysis are the same as those of [CW12], which itself uses essentially the same algorithm outline as Algorithm 1 of [DMIMW12]. The improved bound is stated below (cf. [CW12, Theorem 21]).
2 Least Squares Regression
There is an algorithm for least squares regression running in time and succeeding with probability at least .
Proof. Applying Theorem 3 to the subspace spanned by columns of and , we get a distribution over matrices of size such that preserves lengths of vectors in the subspace up to a factor with probability at least . Thus, we only need to find . Note that has size . By Theorem 12 of [Sar06], there is an algorithm that with probability at least , finds a approximate solution for least squares regression for the smaller input of and and runs in time .
The following theorem follows from using the embedding of Theorem 9 and the same argument as [CW12, Theorem 32].
Let be the rank of . There is an algorithm for least squares regression running in time and succeeding with probability at least .
3 Low Rank Approximation
In this section, we describe the application of our subspace embeddings to low rank approximation. Here given a matrix , one wants to find a rank matrix minimizing . Let be the minimum over all rank matrices . Notice that our matrices are of the same form as sparse JL matrices considered by [KN12] so the following property holds for matrices constructed in Theorem 9 (cf. [CW12, Lemma 24]).
[KN12, Theorem 19] Fix . Let be the distribution over matrices given in Theorem 9 with columns. For any matrices with rows,
The matrices of Theorem 3 are the same as those of [CW12] so the above property holds for them as well. Therefore, the same algorithm and analysis as in [CW12] work. We state the improved bounds using the embedding of Theorem 3 and Theorem 9 below (cf. [CW12, Theorem 36 and 38]).
Given a matrix of size , there are 2 algorithms that, with probability at least , find 3 matrices where is of size , is of size , is of size , , is a diagonal matrix, and
Proof. The proof is essentially the same as that of [CW12] so we only mention the difference. We use 2 bounds for the running time: multiplying an matrix and a matrix with takes time (simply dividing the matrices into blocks), and approximating SVD for an matrix with takes time (time to compute , approximate SVD of in time [DDH07], and compute to complete the SVD of ).
Acknowledgments
We thank Andrew Drucker for suggesting the SNAP acronym for the OSE’s considered in this work, to which we added the “oblivious” descriptor.