Sparser Johnson-Lindenstrauss Transforms
Daniel M. Kane, Jelani Nelson
Introduction
Proofs of the JL lemma can be found in . The value of in the JL lemma is optimal (also see a later proof in ).
The JL lemma is a key ingredient in the JL flattening theorem, which states that any points in Euclidean space can be embedded into dimensions so that all pairwise Euclidean distances are preserved up to . The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, and can further be used to reduce the amount of storage required to store a dataset, e.g. in streaming algorithms. Recently it has also found applications in approximate numerical algebra problems such as linear regression and low-rank approximation . See for further discussions on applications.
Standard proofs of the JL lemma take a distribution over dense matrices (e.g. i.i.d. Gaussian or Bernoulli entries), and thus performing the embedding naïvely takes time where has non-zero entries. Several works have devised other distributions which give faster embedding times , but all these methods require embedding time even for sparse vectors (even when ). This feature is particularly unfortunate in streaming applications, where a vector receives coordinate-wise updates of the form , so that to maintain some linear embedding of we should repeatedly calculate during updates. Since , even the naïve embedding time method is faster than these approaches.
Even aside from streaming applications, several practical situations give rise to vectors with . For example, a common similarity measure for comparing text documents in data mining and information retrieval is cosine similarity , which is approximately preserved under any JL embedding. Here, a document is represented as a bag of words with the dimensionality being the size of the lexicon, and we usually would not expect any single document to contain anywhere near distinct words (i.e., we expect sparse vectors). In networking applications, if counts bytes sent from source to destination in some time interval, then is the total number of IP pairs, whereas we would not expect most pairs of IPs to communicate with each other. In linear algebra applications, a rating matrix may for example have as user ’s score for item (e.g. the Netflix matrix where columns correspond to movies), and we would expect that most users rate only small fraction of all available items.
It is also worth nothing that after the preliminary version of this work was published in , it was shown in that our bound is optimal up to an factor. That is, for any fixed constant , any distribution satisfying Lemma 1 that is supported on matrices with and at most non-zero entries per column must have as long as . Note that once one can always take the distribution supported solely on the identity matrix, giving and satisfying Lemma 1 with .
1 Our Approach
Our constructions are depicted in Figure 1. Figure 1(a) represents the DKS construction of in which each item is hashed to random target coordinates with replacement. Our two schemes achieving are as follows. Construction (b) is much like (a) except that we hash coordinates times without replacement; we call this the graph construction, since hash locations are specified by a bipartite graph with left vertices, right vertices, and left-degree . In (c), the target vector is divided into contiguous blocks each of equal size , and a given coordinate in the original vector is hashed to a random location in each block (essentially this is the CountSketch of , though we use a higher degree of independence in our hash functions); we call this the block construction. In all cases (a), (b), and (c), we randomly flip the sign of a coordinate in the original vector and divide by before adding it in any location in the target vector.
We give two different analyses for both our constructions (b) and (c). Since we consider linear embeddings, without loss of generality we can assume , in which case the JL lemma follows by showing that , which is implied by . Thus it suffices to show that for any unit norm ,
We furthermore observe that both our graph and block constructions have the property that the entries of our embedding matrix can be written as
where the are independent and uniform in , and is an indicator random variable for the event (in fact in our analyses we will only need that the are -wise independent). Note that the are not independent, since in both constructions we have that there are exactly non-zero entries per column. Furthermore in the block construction, knowing that for in some block implies that for all other in the same block.
To outline our analyses, look at the random variable
In our second analysis approach, we define
We point out here that Figure 1(c) is somewhat simpler to implement, since there are simple constructions of -wise hash families . Figure 1(b) on the other hand requires hashing without replacement, which amounts to using random permutations and can be derandomized using almost -wise independent permutation families (see Remark 14).
Conventions and Notation
Code-Based Constructions
In this section, we provide analyses of our constructions (b) and (c) in Figure 1 when the non-zero entry locations are deterministic but satisfy a certain condition. In particular, in the analysis in this section we assume that for any ,
That is, no two columns have their non-zero entries in more than of the same rows. We show how to use error-correcting codes to ensure Eq. (6) in Remark 8 for the block construction, and in Remark 9 for the graph construction. Unfortunately this step will require setting to be slightly larger than the desired . We give an alternate analysis in Section 4 which avoids assuming Eq. (6) and obtains an improved bound for by not using deterministic .
We prove our construction satisfies the JL lemma by applying Theorem 4 with .
where the first inequality used Eq. (6).
By Eq. (1), it now suffices to prove the following theorem.
We now discuss how to choose the non-zero locations in to ensure Eq. (6).
It is also possible to use a code to specify the hash locations in the graph construction. In particular, let the th entry of the th column of the embedding matrix be the th symbol of the th codeword (which we call ) in a weight- binary code of minimum distance for . Define for as an indicator variable for . Then, the error is again exactly as in Eq. (3). Also, as in Remark 8, such a code can be shown to exist via the probabilistic method (the Chernoff bound can be applied using negative dependence, followed by a union bound) as long as . We omit the details since Section 4 obtains better parameters.
Only using Eq. (6), it is impossible to improve our sparsity bound further. For example, consider an instantiation of the block construction in which Eq. (6) is satisfied. Create a new set of which change only in the case so that for all , so that Eq. (6) still holds. In our construction this corresponds to all indices colliding in the first chunk of coordinates, which creates an error term of . Now, suppose consists of entries each with value . Then, with probability , all these entries receive the same sign under and contribute a total error of in the first chunk alone. We thus need , which implies .
Random Hashing Constructions
In this section, we show that if the hash functions described in Remark 8 and Remark 9 are not specified by fixed codes, but rather are chosen at random from some family of sufficiently high independence, then one can achieve sparsity (in the case of Figure 1(b), we actually need almost k-wise independent permutations). Recall our bottleneck in reducing the sparsity in Section 3 was actually obtaining the codes, discussed in Remark 8 and Remark 9.
Define as the set of directed multigraphs with edges having distinct labels in and no self-loops, with between and vertices (inclusive), and where every vertex has non-zero and even degree (we use degree to denote the sum of in- and out-degrees). Let map variable sequences to their corresponding graph. That is, we draw a directed edge labeled from the vertex representing to that representing for , where one vertex represents all the which are assigned the same element of (see Figure 2). For a graph , let be its number of vertices, and let be the degree of vertex . By construction every monomial maps to a graph with edges. Also we need only consider graphs with all even vertex degrees since a monomial whose graph has at least one vertex with odd degree will have at least one random sign appearing an odd number of times and thus have expectation zero. Then,
where is the set of all directed multigraphs as in , but in which vertices are labeled as well, with distinct labels in (see Figure 2; the vertex labels can be arbitrarily permuted).
Eq. (10) used that are independent for any . For Eq. (11), note that , and the coefficient of in its expansion for is . Meanwhile, the coefficient of this monomial when summing over all for a particular is at most . For Eq. (12), we move from graphs in to those in , and for any there are exactly ways to label vertices. This is because for any graph there is a canonical way of labeling the vertices as since there are no isolated vertices. Namely, the vertices can be labeled in increasing order of when they are first visited by an edge when processing edges in order of increasing label (if two vertices are both visited for the first time simultaneously by some edge, then we can break ties consistently using the direction of the edge). Thus the vertices are all identified by this canonical labeling, implying that the vertex labelings all give distinct graphs in . Eq. (13) follows since and
The summation over in Eq. (13) is over the with vertices. Let us bound this summation for some fixed choice of vertex degrees . For any given , consider the set of all graphs on labeled vertices with distinct labels in , and with edges with distinct labels in (that is, we do not require even edge degrees, and some vertices may even have degree ). For a graph , let represent the degree of vertex in . For define the function
Let be those graphs with vertices such that vertex has degree . Then
since . To upper bound , note . For , note any graph in can be formed by taking a graph and adding an edge labeled from to for some vertices in . This change causes to both increase by , whereas all other degrees stay the same. Thus considering Eq. (14),
with the last inequality using Cauchy-Schwarz. Thus by induction, . Since , we have . We then have that the summation in Eq. (13) is at most the number of choices of even summing to (there are such choices), times , implying
By differentiation, the quantity is maximized for (recall ), giving our lemma.
Proof. We use Lemma 11. In the case we can multiply the term by and still obtain an upper bound, and in the case of larger we have since . Also when we have , so that .
It is worth noting that if one wants distortion with probability simultaneously for all in some set , our proof of Theorem 13 reveals that it suffices to set and .
Tightness of analyses
The main theorem of this section is the following.
The DKS construction of requires sparsity to achieve distortion with success probability .
Our proof will use the following standard fact.
which is much larger than for smaller than some constant. Now, given a collision, the colliding items have the same sign with probability .
which is much larger than for smaller than some constant. Now, given a collision, the colliding items have the same sign with probability .
We lastly consider the case for some constant (depending on ) to be determined later. First note this case only exists when . Define . Suppose there exists an integer so that
First we show it is possible to satisfy the above conditions simultaneously for our range of . We set , satisfying item 1 trivially, and item 2 since . For item 3, Fact 17 gives
The term is at least by the settings of , and the term is also at least for sufficiently small.
2 Tightness of Figure 1(b) analysis
For smaller than a constant depending on for , the graph construction of Section 4 requires to obtain distortion with probability .
Proof. First suppose . We consider a vector with non-zero coordinates each of value . If there is exactly one set with such that are both non-zero for the embedding matrix (i.e., there is exactly one collision), then the total error is . It just remains to show that this happens with probability larger than . The probability of this occurring is
Now consider the case for some small constant . Consider the vector . Suppose there are exactly collisions, i.e. distinct values of such that are both non-zero (to avoid tedium we disregard floors and ceilings and just assume is an integer). Also, suppose that in each colliding row we have . Then, the total error would be . It just remains to show that this happens with probability larger than . The probability of signs agreeing in exactly chunks is , which is larger than for . The probability of exactly collisions is
It suffices for the right hand side to be at least since is independent of , and thus the total probability of error larger than would be greater than . Taking natural logarithms, it suffices to have
Writing and , the left hand side is . Taking a derivative shows is monotonically increasing for . Thus as long as for a sufficiently small constant , . Also, the term is at most for sufficiently small.
3 Tightness of Figure 1(c) analysis
For smaller than a constant depending on for , the block construction of Section 4 requires to obtain distortion with probability .
Proof. First suppose . Consider a vector with non-zero coordinates each of value . If there is exactly one set with such that (i.e. exactly one collision), then the total error is . It just remains to show that this happens with probability larger than .
The probability of exactly one collision is
which is larger than for smaller than a universal constant.
Now consider for some small constant . Consider the vector . Suppose there are exactly collisions, i.e. distinct values of such that (to avoid tedium we disregard floors and ceilings and just assume is an integer). Also, suppose that in each colliding chunk we have . Then, the total error would be . It just remains to show that this happens with probability larger than . The probability of signs agreeing in exactly chunks is , which is larger than for . The probability of exactly collisions is
The above is at most , by the analysis following Eq. (22). Since is independent of , the total probability of having error larger than is greater than .
Faster numerical linear algebra streaming algorithms
Now, the following theorem is a generalization of [10, Theorem 2.1]. The theorem states that any distribution with JL moments also provides a sketch for approximate matrix products. A similar statement was made in [34, Lemma 6], but that statement was slightly weaker in its parameters because it resorted to a union bound, which we avoid by using Minkowski’s inequality.
Often when one constructs a JL distribution over matrices, it is shown that for all with and for all ,
Any such distribution automatically satisfies the -JL moment property for any by converting the tail bound into a moment bound via integration by parts.
Now we arrive at the main point of this section. Several algorithms for approximate linear regression and best rank- approximation in simply maintain as is updated, where comes from the JL distribution with -wise independent entries. In fact though, their analyses of their algorithms only use the fact that this distribution satisfies the approximate matrix product sketch guarantees of Theorem 21. Due to Theorem 21 though, we know that any distribution satisfying the -JL moment condition gives an approximate matrix product sketch. Thus, random Bernoulli matrices may be replaced with our sparse JL distributions in this work. We now state some of the algorithmic results given in and describe how our constructions provide improvements in the update time (the time to process new columns, rows, or turnstile updates).
As in , when stating our results we will ignore the space and time complexities of storing and evaluating the hash functions in our JL distributions. We discuss this issue later in Remark 26.
There is a one-pass streaming algorithm for linear regression in the turnstile model where one maintains a sketch of size . Processing each update requires arithmetic operations and hash function evaluations.
Theorem 24 improves the update complexity of , which was .
2 Low rank approximation
Theorem 4.4 of gives a -pass algorithm where in the first pass, one maintains where is drawn from a distribution that simultaneously satisfies both the and -JL moment properties for some fixed constant in their proof. It is also assumed that . The first pass is thus sped up again as in Theorem 24.
One-pass algorithm for column/row-wise updates:
Theorem 4.5 of gives a one-pass algorithm in the case that is seen either one whole column or row at a time. The algorithm maintains both and where is drawn from a distribution that simultaneously satisfies both the and -JL moment properties. This implies the following.
There is a one-pass streaming algorithm for approximate low rank approximation with row/column-wise updates where one maintains a sketch of size . Processing each update requires amortized arithmetic operations and hash function evaluations per entry of .
Theorem 25 improves the amortized update complexity of , which was .
Three-pass algorithm for row-wise updates:
Theorem 4.6 of gives a three-pass algorithm using less space in the case that is seen one row at a time. Again, the first pass simply maintains where is drawn from a distribution that satisfies both the and -JL moment properties. This pass is sped up using our sparser JL distribution.
One-pass algorithm in the turnstile model, bi-criteria:
Theorem 4.7 of gives a one-pass algorithm under turnstile updates where and are maintained in the stream. is drawn from a distribution satisfying both the and -JL moment properties. is drawn from a distribution satisfying both the and -JL moment properties. Theorem 4.7 of then shows how to compute a matrix of rank which achieves the desired error guarantee given and .
One-pass algorithm in the turnstile model:
Theorem 4.9 of gives a one-pass algorithm under turnstile updates where and are maintained in the stream. is drawn from a distribution satisfying both the and -JL moment properties. is drawn from a distribution satisfying both the and -JL moment properties. Theorem 4.9 of then shows how to compute a matrix of rank which achieves the desired error guarantee given and .
Let be a ring, and let be a degree- polynomial. Then, given distinct , all the values can be computed using operations over .
Open Problems
In this section we state two explicit open problems. For the first, observe that our graph construction is quite similar to a sparse JL construction of Achlioptas . The work of proposes a random normalized sign matrix where each column has an expected number of non-zero entries, so that in the notation of this work, the are i.i.d. Bernoulli with expectation . Using this construction, was able to achieve without causing to increase over analyses of dense constructions, even by a constant factor. Meanwhile, our graph construction requires that there be exactly non-zero entries per column. This sole change was the reason we were able to obtain better asymptotic bounds on the sparsity of in this work, but in fact we conjecture an even stronger benefit than just asymptotic improvement. The first open problem is to resolve the following conjecture.
A positive resolution of this conjecture would imply that not only does our graph construction obtain better asymptotic performance than , but in fact obtains stronger performance in a very definitive sense.
Question: Can we obtain a tight lower bound of for distributional JL in the case that , thus removing the factor gap?