Sparser Johnson-Lindenstrauss Transforms

Daniel M. Kane, Jelani Nelson

Introduction

Proofs of the JL lemma can be found in . The value of $k$ in the JL lemma is optimal (also see a later proof in ).

The JL lemma is a key ingredient in the JL flattening theorem, which states that any $n$ points in Euclidean space can be embedded into $O(\varepsilon^{-2}\log n)$ dimensions so that all pairwise Euclidean distances are preserved up to $1\pm\varepsilon$ . The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, and can further be used to reduce the amount of storage required to store a dataset, e.g. in streaming algorithms. Recently it has also found applications in approximate numerical algebra problems such as linear regression and low-rank approximation . See for further discussions on applications.

Standard proofs of the JL lemma take a distribution over dense matrices (e.g. i.i.d. Gaussian or Bernoulli entries), and thus performing the embedding naïvely takes $O(k\cdot\|x\|_{0})$ time where $x$ has $\|x\|_{0}$ non-zero entries. Several works have devised other distributions which give faster embedding times , but all these methods require $\Omega(d\log d)$ embedding time even for sparse vectors (even when $\|x\|_{0}=1$ ). This feature is particularly unfortunate in streaming applications, where a vector $x$ receives coordinate-wise updates of the form $x\leftarrow x+v\cdot e_{i}$ , so that to maintain some linear embedding $Sx$ of $x$ we should repeatedly calculate $Se_{i}$ during updates. Since $\|e_{i}\|_{0}=1$ , even the naïve $O(k\cdot\|e_{i}\|_{0})$ embedding time method is faster than these approaches.

Even aside from streaming applications, several practical situations give rise to vectors with $\|x\|_{0}\ll d$ . For example, a common similarity measure for comparing text documents in data mining and information retrieval is cosine similarity , which is approximately preserved under any JL embedding. Here, a document is represented as a bag of words with the dimensionality $d$ being the size of the lexicon, and we usually would not expect any single document to contain anywhere near $d$ distinct words (i.e., we expect sparse vectors). In networking applications, if $x_{i,j}$ counts bytes sent from source $i$ to destination $j$ in some time interval, then $d$ is the total number of IP pairs, whereas we would not expect most pairs of IPs to communicate with each other. In linear algebra applications, a rating matrix $A$ may for example have $A_{i,j}$ as user $i$ ’s score for item $j$ (e.g. the Netflix matrix where columns correspond to movies), and we would expect that most users rate only small fraction of all available items.

It is also worth nothing that after the preliminary version of this work was published in , it was shown in that our bound is optimal up to an $O(\log(1/\varepsilon))$ factor. That is, for any fixed constant $c>0$ , any distribution satisfying Lemma 1 that is supported on matrices with $k=O(\varepsilon^{-c}\log(1/\delta))$ and at most $s$ non-zero entries per column must have $s=\Omega(\varepsilon^{-1}\log(1/\delta)/\log(1/\varepsilon))$ as long as $k=O(d/\log(1/\varepsilon))$ . Note that once $k\geq d$ one can always take the distribution supported solely on the $d\times d$ identity matrix, giving $s=1$ and satisfying Lemma 1 with $\varepsilon=0$ .

1 Our Approach

Our constructions are depicted in Figure 1. Figure 1(a) represents the DKS construction of in which each item is hashed to $s$ random target coordinates with replacement. Our two schemes achieving $s=\Theta(\varepsilon^{-1}\log(1/\delta))$ are as follows. Construction (b) is much like (a) except that we hash coordinates $s$ times without replacement; we call this the graph construction, since hash locations are specified by a bipartite graph with $d$ left vertices, $k$ right vertices, and left-degree $s$ . In (c), the target vector is divided into $s$ contiguous blocks each of equal size $k/s$ , and a given coordinate in the original vector is hashed to a random location in each block (essentially this is the CountSketch of , though we use a higher degree of independence in our hash functions); we call this the block construction. In all cases (a), (b), and (c), we randomly flip the sign of a coordinate in the original vector and divide by $\sqrt{s}$ before adding it in any location in the target vector.

We give two different analyses for both our constructions (b) and (c). Since we consider linear embeddings, without loss of generality we can assume $\|x\|_{2}=1$ , in which case the JL lemma follows by showing that $\|Sx\|_{2}^{2}\in[(1-\varepsilon)^{2},(1+\varepsilon)^{2}]$ , which is implied by $|\|Sx\|_{2}^{2}-1|\leq 2\varepsilon-\varepsilon^{2}$ . Thus it suffices to show that for any unit norm $x$ ,

We furthermore observe that both our graph and block constructions have the property that the entries of our embedding matrix $S$ can be written as

where the $\sigma_{i,j}$ are independent and uniform in $\{-1,1\}$ , and $\eta_{i,j}$ is an indicator random variable for the event $S_{i,j}\neq 0$ (in fact in our analyses we will only need that the $\sigma_{i,j}$ are $O(\log(1/\delta))$ -wise independent). Note that the $\eta_{i,j}$ are not independent, since in both constructions we have that there are exactly $s$ non-zero entries per column. Furthermore in the block construction, knowing that $\eta_{i,j}=1$ for $j$ in some block implies that $\eta_{i,j^{\prime}}=0$ for all other $j^{\prime}$ in the same block.

To outline our analyses, look at the random variable

In our second analysis approach, we define

We point out here that Figure 1(c) is somewhat simpler to implement, since there are simple constructions of $O(\log(1/\delta))$ -wise hash families . Figure 1(b) on the other hand requires hashing without replacement, which amounts to using random permutations and can be derandomized using almost $O(\log(1/\delta))$ -wise independent permutation families (see Remark 14).

Conventions and Notation

Code-Based Constructions

In this section, we provide analyses of our constructions (b) and (c) in Figure 1 when the non-zero entry locations are deterministic but satisfy a certain condition. In particular, in the analysis in this section we assume that for any $i\neq j\in[d]$ ,

That is, no two columns have their non-zero entries in more than $O(s^{2}/k)$ of the same rows. We show how to use error-correcting codes to ensure Eq. (6) in Remark 8 for the block construction, and in Remark 9 for the graph construction. Unfortunately this step will require setting $s$ to be slightly larger than the desired $O(\varepsilon^{-1}\log(1/\delta))$ . We give an alternate analysis in Section 4 which avoids assuming Eq. (6) and obtains an improved bound for $s$ by not using deterministic $\eta_{r,i}$ .

We prove our construction satisfies the JL lemma by applying Theorem 4 with $z=\sigma,B=T$ .

where the first inequality used Eq. (6). $\blacksquare$

By Eq. (1), it now suffices to prove the following theorem.

We now discuss how to choose the non-zero locations in $S$ to ensure Eq. (6).

It is also possible to use a code to specify the hash locations in the graph construction. In particular, let the $j$ th entry of the $i$ th column of the embedding matrix be the $j$ th symbol of the $i$ th codeword (which we call $h(i,j)$ ) in a weight- $s$ binary code of minimum distance $2s-O(s^{2}/k)$ for $s\geq 2\varepsilon^{-1}\log(1/\delta)$ . Define $\eta_{i,j,r}$ for $i,j\in[d],r\in[s]$ as an indicator variable for $h(i,r)=h(j,r)=1$ . Then, the error is again exactly as in Eq. (3). Also, as in Remark 8, such a code can be shown to exist via the probabilistic method (the Chernoff bound can be applied using negative dependence, followed by a union bound) as long as $s=\Omega(\varepsilon^{-1}\sqrt{\log(d/\delta)\log(1/\delta)})$ . We omit the details since Section 4 obtains better parameters.

Only using Eq. (6), it is impossible to improve our sparsity bound further. For example, consider an instantiation of the block construction in which Eq. (6) is satisfied. Create a new set of $\eta_{r,i}$ which change only in the case $r=1$ so that $\eta_{1,i}=1$ for all $i$ , so that Eq. (6) still holds. In our construction this corresponds to all indices colliding in the first chunk of $k/s$ coordinates, which creates an error term of $(1/s)\cdot\sum_{i\neq j}x_{i}x_{j}\sigma_{r,i}\sigma_{r,j}$ . Now, suppose $x$ consists of $t=(1/2)\cdot\log(1/\delta)$ entries each with value $1/\sqrt{t}$ . Then, with probability $\sqrt{\delta}\gg\delta$ , all these entries receive the same sign under $\sigma$ and contribute a total error of $\Omega(t/s)$ in the first chunk alone. We thus need $t/s=O(\varepsilon)$ , which implies $s=\Omega(\varepsilon^{-1}\log(1/\delta))$ .

Random Hashing Constructions

In this section, we show that if the hash functions $h$ described in Remark 8 and Remark 9 are not specified by fixed codes, but rather are chosen at random from some family of sufficiently high independence, then one can achieve sparsity $O(\varepsilon^{-1}\log(1/\delta))$ (in the case of Figure 1(b), we actually need almost k-wise independent permutations). Recall our bottleneck in reducing the sparsity in Section 3 was actually obtaining the codes, discussed in Remark 8 and Remark 9.

Define $\mathcal{G}_{t}$ as the set of directed multigraphs with $t$ edges having distinct labels in $[t]$ and no self-loops, with between $2$ and $t$ vertices (inclusive), and where every vertex has non-zero and even degree (we use degree to denote the sum of in- and out-degrees). Let $f$ map variable sequences to their corresponding graph. That is, we draw a directed edge labeled $u$ from the vertex representing $i_{u}$ to that representing $j_{u}$ for $u=1,\ldots,t$ , where one vertex represents all the $i_{u},j_{u}$ which are assigned the same element of $[d]$ (see Figure 2). For a graph $G$ , let $v$ be its number of vertices, and let $d_{u}$ be the degree of vertex $u$ . By construction every monomial maps to a graph with $t$ edges. Also we need only consider graphs with all even vertex degrees since a monomial whose graph has at least one vertex with odd degree will have at least one random sign $\sigma_{i,r_{u}}$ appearing an odd number of times and thus have expectation zero. Then,

where $\mathcal{G}^{\prime}_{t}$ is the set of all directed multigraphs as in $\mathcal{G}_{t}$ , but in which vertices are labeled as well, with distinct labels in $[v]$ (see Figure 2; the vertex labels can be arbitrarily permuted).

Eq. (10) used that $\eta_{r,1},\ldots,\eta_{r,d}$ are independent for any $r$ . For Eq. (11), note that $(\|x\|_{2}^{2})^{t}=1$ , and the coefficient of $\prod_{u=1}^{v}x_{a_{u}}^{d_{u}}$ in its expansion for $\sum_{u=1}^{v}d_{u}=2t$ is $\binom{t}{d_{1}/2,\ldots,d_{v}/2}$ . Meanwhile, the coefficient of this monomial when summing over all $i_{1}\neq j_{1},\ldots,i_{t}\neq j_{t}$ for a particular $G\in\mathcal{G}_{t}$ is at most $v!$ . For Eq. (12), we move from graphs in $\mathcal{G}_{t}$ to those in $\mathcal{G}_{t}^{\prime}$ , and for any $G\in\mathcal{G}_{t}$ there are exactly $v!$ ways to label vertices. This is because for any graph $G\in\mathcal{G}_{t}$ there is a canonical way of labeling the vertices as $1,\ldots,v$ since there are no isolated vertices. Namely, the vertices can be labeled in increasing order of when they are first visited by an edge when processing edges in order of increasing label (if two vertices are both visited for the first time simultaneously by some edge, then we can break ties consistently using the direction of the edge). Thus the vertices are all identified by this canonical labeling, implying that the $v!$ vertex labelings all give distinct graphs in $\mathcal{G}^{\prime}_{t}$ . Eq. (13) follows since $t!\geq t^{t}/e^{t}$ and

The summation over $G$ in Eq. (13) is over the $G\in\mathcal{G}_{t}^{\prime}$ with $v$ vertices. Let us bound this summation for some fixed choice of vertex degrees $d_{1},\ldots,d_{v}$ . For any given $i$ , consider the set of all graphs $\mathcal{G}^{\prime\prime}_{i}$ on $v$ labeled vertices with distinct labels in $[v]$ , and with $i$ edges with distinct labels in $[i]$ (that is, we do not require even edge degrees, and some vertices may even have degree ). For a graph $G\in\mathcal{G}^{\prime\prime}_{i}$ , let $d_{u}^{\prime}$ represent the degree of vertex $u$ in $G$ . For $a_{1},\ldots,a_{v}>0$ define the function

Let $\mathcal{G^{\prime}}_{t}(d_{1},\ldots,d_{v})$ be those graphs $G\in\mathcal{G}^{\prime}_{t}$ with $v$ vertices such that vertex $u$ has degree $d_{u}$ . Then

since $\mathcal{G}^{\prime}_{t}(d_{1},\ldots,d_{v})\subset\mathcal{G}^{\prime\prime}_{t}$ . To upper bound $S_{t}(a_{1},\ldots,a_{v})$ , note $S_{0}(a_{1},\ldots,a_{v})=1$ . For $i>1$ , note any graph in $\mathcal{G^{\prime\prime}}_{i}$ can be formed by taking a graph $G\in\mathcal{G^{\prime\prime}}_{i-1}$ and adding an edge labeled $i$ from $u$ to $w$ for some vertices $u\neq w$ in $G$ . This change causes $d_{u}^{\prime},d_{w}^{\prime}$ to both increase by $1$ , whereas all other degrees stay the same. Thus considering Eq. (14),

with the last inequality using Cauchy-Schwarz. Thus by induction, $S_{t}(a_{1},\ldots,a_{v})\leq(\sum_{u=1}^{v}a_{u})^{t}\cdot v^{t}$ . Since $\sum_{u=1}^{v}d_{u}=2t$ , we have $S_{t}(d_{1},\ldots,d_{v})\leq(2tv)^{t}$ . We then have that the summation in Eq. (13) is at most the number of choices of even $d_{1},\ldots,d_{v}$ summing to $2t$ (there are $\binom{t-1}{v-1}<2^{t}$ such choices), times $(2tv)^{t}$ , implying

By differentiation, the quantity $(s/k)^{v}v^{t}$ is maximized for $v=\max\left\{2,t/\ln(k/s)\right\}$ (recall $v\geq 2$ ), giving our lemma. $\blacksquare$

Proof. We use Lemma 11. In the case $t<2\ln(k/s)$ we can multiply the $(s/k)^{2}$ term by $t^{t}$ and still obtain an upper bound, and in the case of larger $t$ we have $(t/\ln(k/s))^{t}\leq t^{t}$ since $k\geq s$ . Also when $t\geq 2\ln(k/s)$ we have $e^{t}(s/k)^{2}\geq 1$ , so that $t(2e^{2})^{t}t^{t}\leq t(2e^{3})^{t}(s/k)^{2}t^{t}$ . $\blacksquare$

It is worth noting that if one wants distortion $1\pm\varepsilon_{i}$ with probability $1-\delta_{i}$ simultaneously for all $i$ in some set $S$ , our proof of Theorem 13 reveals that it suffices to set $s=C\cdot\sup_{i\in S}\varepsilon_{i}^{-1}\log(1/\delta_{i})$ and $k=C\cdot\sup_{i\in S}\varepsilon_{i}^{-2}\log(1/\delta_{i})$ .

Tightness of analyses

The main theorem of this section is the following.

The DKS construction of requires sparsity $s=\Omega(\varepsilon^{-1}\cdot\left\lceil\log^{2}(1/\delta)/\log^{2}(1/\varepsilon)\right\rceil)$ to achieve distortion $1\pm\varepsilon$ with success probability $1-\delta$ .

Our proof will use the following standard fact.

which is much larger than $\delta/2$ for $\delta$ smaller than some constant. Now, given a collision, the colliding items have the same sign with probability $1/2$ .

which is much larger than $\delta/2$ for $\delta$ smaller than some constant. Now, given a collision, the colliding items have the same sign with probability $1/8$ .

We lastly consider the case $4/\varepsilon<s\leq 2c\varepsilon^{-1}\log^{2}(1/\delta)/\log^{2}(1/\varepsilon)$ for some constant $c>0$ (depending on $C$ ) to be determined later. First note this case only exists when $\delta=O(\varepsilon)$ . Define $x=(1,0,\ldots,0)$ . Suppose there exists an integer $q$ so that

First we show it is possible to satisfy the above conditions simultaneously for our range of $s$ . We set $q=2\sqrt{\varepsilon s}$ , satisfying item 1 trivially, and item 2 since $s>4/\varepsilon$ . For item 3, Fact 17 gives

The $e^{-s/k}\cdot(1-(s/k^{2}))$ term is at least $\delta^{1/6}$ by the settings of $s,k$ , and the $(s/(qk))^{q}$ term is also at least $\delta^{1/6}$ for $c$ sufficiently small.

2 Tightness of Figure 1(b) analysis

For $\delta$ smaller than a constant depending on $C$ for $k=C\varepsilon^{-2}\log(1/\delta)$ , the graph construction of Section 4 requires $s=\Omega(\varepsilon^{-1}\log(1/\delta))$ to obtain distortion $1\pm\varepsilon$ with probability $1-\delta$ .

Proof. First suppose $s\leq 1/(2\varepsilon)$ . We consider a vector with $t=\left\lfloor 1/(s\varepsilon)\right\rfloor$ non-zero coordinates each of value $1/\sqrt{t}$ . If there is exactly one set $i,j,r$ with $i\neq j$ such that $S_{r,i},S_{r,j}$ are both non-zero for the embedding matrix $S$ (i.e., there is exactly one collision), then the total error is $2/(ts)\geq 2\varepsilon$ . It just remains to show that this happens with probability larger than $\delta$ . The probability of this occurring is

Now consider the case $1/(2\varepsilon)<s<c\cdot\varepsilon^{-1}\log(1/\delta)$ for some small constant $c$ . Consider the vector $(1/\sqrt{2},1/\sqrt{2},0,\ldots,0)$ . Suppose there are exactly $2s\varepsilon$ collisions, i.e. $2s\varepsilon$ distinct values of $r$ such that $S_{r,i},S_{j,r}$ are both non-zero (to avoid tedium we disregard floors and ceilings and just assume $s\varepsilon$ is an integer). Also, suppose that in each colliding row $r$ we have $\sigma(1,r)=\sigma(2,r)$ . Then, the total error would be $2\varepsilon$ . It just remains to show that this happens with probability larger than $\delta$ . The probability of signs agreeing in exactly $2\varepsilon s$ chunks is $2^{-2\varepsilon s}>2^{-2c\log(1/\delta)}$ , which is larger than $\sqrt{\delta}$ for $c<1/4$ . The probability of exactly $2\varepsilon s$ collisions is

It suffices for the right hand side to be at least $\sqrt{\delta}$ since $h$ is independent of $\sigma$ , and thus the total probability of error larger than $2\varepsilon$ would be greater than $\sqrt{\delta}^{2}=\delta$ . Taking natural logarithms, it suffices to have

Writing $s=q/\varepsilon$ and $a=4C\log(1/\delta)$ , the left hand side is $2q\ln(a/q)+\Theta(s^{2}/k)$ . Taking a derivative shows $2q\ln(a/q)$ is monotonically increasing for $q<a/e$ . Thus as long as $q<ca$ for a sufficiently small constant $c$ , $2q\ln(a/q)<\ln(1/\delta)/4$ . Also, the $\Theta(s^{2}/k)$ term is at most $\ln(1/\delta)/4$ for $c$ sufficiently small. $\blacksquare$

3 Tightness of Figure 1(c) analysis

For $\delta$ smaller than a constant depending on $C$ for $k=C\varepsilon^{-2}\log(1/\delta)$ , the block construction of Section 4 requires $s=\Omega(\varepsilon^{-1}\log(1/\delta))$ to obtain distortion $1\pm\varepsilon$ with probability $1-\delta$ .

Proof. First suppose $s\leq 1/(2\varepsilon)$ . Consider a vector with $t=\left\lfloor 1/(s\varepsilon)\right\rfloor$ non-zero coordinates each of value $1/\sqrt{t}$ . If there is exactly one set $i,j,r$ with $i\neq j$ such that $h(i,r)=h(j,r)$ (i.e. exactly one collision), then the total error is $2/(ts)\geq 2\varepsilon$ . It just remains to show that this happens with probability larger than $\delta$ .

The probability of exactly one collision is

which is larger than $\delta$ for $\delta$ smaller than a universal constant.

Now consider $1/(2\varepsilon)<s<c\cdot\varepsilon^{-1}\log(1/\delta)$ for some small constant $c$ . Consider the vector $x=(1/\sqrt{2},1/\sqrt{2},0,\ldots,0)$ . Suppose there are exactly $2s\varepsilon$ collisions, i.e. $2s\varepsilon$ distinct values of $r$ such that $h(1,r)=h(2,r)$ (to avoid tedium we disregard floors and ceilings and just assume $s\varepsilon$ is an integer). Also, suppose that in each colliding chunk $r$ we have $\sigma(1,r)=\sigma(2,r)$ . Then, the total error would be $2\varepsilon$ . It just remains to show that this happens with probability larger than $\delta$ . The probability of signs agreeing in exactly $2\varepsilon s$ chunks is $2^{-2\varepsilon s}>2^{-2c\log(1/\delta)}$ , which is larger than $\sqrt{\delta}$ for $c<1/4$ . The probability of exactly $2\varepsilon s$ collisions is

The above is at most $\sqrt{\delta}$ , by the analysis following Eq. (22). Since $h$ is independent of $\sigma$ , the total probability of having error larger than $2\varepsilon$ is greater than $\sqrt{\delta}^{2}=\delta$ . $\blacksquare$

Faster numerical linear algebra streaming algorithms

Now, the following theorem is a generalization of [10, Theorem 2.1]. The theorem states that any distribution with JL moments also provides a sketch for approximate matrix products. A similar statement was made in [34, Lemma 6], but that statement was slightly weaker in its parameters because it resorted to a union bound, which we avoid by using Minkowski’s inequality.

Often when one constructs a JL distribution $\mathcal{D}$ over $k\times d$ matrices, it is shown that for all $x$ with $\|x\|_{2}=1$ and for all $\varepsilon>0$ ,

Any such distribution automatically satisfies the $(\varepsilon,e^{-\Omega(\varepsilon^{2}k+\varepsilon k)},\min\{\varepsilon^{2}k,\varepsilon k\})$ -JL moment property for any $\varepsilon>0$ by converting the tail bound into a moment bound via integration by parts.

Now we arrive at the main point of this section. Several algorithms for approximate linear regression and best rank- $k$ approximation in simply maintain $SA$ as $A$ is updated, where $S$ comes from the JL distribution with $\Omega(\log(1/\delta))$ -wise independent $\pm 1/\sqrt{k}$ entries. In fact though, their analyses of their algorithms only use the fact that this distribution satisfies the approximate matrix product sketch guarantees of Theorem 21. Due to Theorem 21 though, we know that any distribution satisfying the $(\varepsilon,\delta)$ -JL moment condition gives an approximate matrix product sketch. Thus, random Bernoulli matrices may be replaced with our sparse JL distributions in this work. We now state some of the algorithmic results given in and describe how our constructions provide improvements in the update time (the time to process new columns, rows, or turnstile updates).

As in , when stating our results we will ignore the space and time complexities of storing and evaluating the hash functions in our JL distributions. We discuss this issue later in Remark 26.

There is a one-pass streaming algorithm for linear regression in the turnstile model where one maintains a sketch of size $O(n^{2}\varepsilon^{-1}\log(1/\delta)\log(nd))$ . Processing each update requires $O(n+\sqrt{n/\varepsilon}\cdot\log(1/\delta))$ arithmetic operations and hash function evaluations.

Theorem 24 improves the update complexity of , which was $O(n\varepsilon^{-1}\log(1/\delta))$ .

2 Low rank approximation

Theorem 4.4 of gives a $2$ -pass algorithm where in the first pass, one maintains $SA$ where $S$ is drawn from a distribution that simultaneously satisfies both the $(1/2,\eta^{-r}\delta)$ and $(\sqrt{\varepsilon/r},\delta)$ -JL moment properties for some fixed constant $\eta>1$ in their proof. It is also assumed that $\rho\geq 2r+1$ . The first pass is thus sped up again as in Theorem 24.

One-pass algorithm for column/row-wise updates:

Theorem 4.5 of gives a one-pass algorithm in the case that $A$ is seen either one whole column or row at a time. The algorithm maintains both $SA$ and $SAA^{T}$ where $S$ is drawn from a distribution that simultaneously satisfies both the $(1/2,\eta^{-r}\delta)$ and $(\sqrt{\varepsilon/r},\delta)$ -JL moment properties. This implies the following.

There is a one-pass streaming algorithm for approximate low rank approximation with row/column-wise updates where one maintains a sketch of size $O(r\varepsilon^{-1}(n+d)\log(1/\delta)\log(nd))$ . Processing each update requires $O(r+\sqrt{r/\varepsilon}\cdot\log(1/\delta))$ amortized arithmetic operations and hash function evaluations per entry of $A$ .

Theorem 25 improves the amortized update complexity of , which was $O(r\varepsilon^{-1}\log(1/\delta))$ .

Three-pass algorithm for row-wise updates:

Theorem 4.6 of gives a three-pass algorithm using less space in the case that $A$ is seen one row at a time. Again, the first pass simply maintains $SA$ where $S$ is drawn from a distribution that satisfies both the $(1/2,\eta^{-r}\delta)$ and $(\sqrt{\varepsilon/r},\delta)$ -JL moment properties. This pass is sped up using our sparser JL distribution.

One-pass algorithm in the turnstile model, bi-criteria:

Theorem 4.7 of gives a one-pass algorithm under turnstile updates where $SA$ and $RA^{T}$ are maintained in the stream. $S$ is drawn from a distribution satisfying both the $(1/2,\eta^{-r\log(1/\delta)/\varepsilon}\delta)$ and $(\varepsilon/\sqrt{r\log(1/\delta)},\delta)$ -JL moment properties. $R$ is drawn from a distribution satisfying both the $(1/2,\eta^{-r}\delta)$ and $(\sqrt{\varepsilon/r},\delta)$ -JL moment properties. Theorem 4.7 of then shows how to compute a matrix of rank $O(r\varepsilon^{-1}\log(1/\delta))$ which achieves the desired error guarantee given $SA$ and $RA^{T}$ .

One-pass algorithm in the turnstile model:

Theorem 4.9 of gives a one-pass algorithm under turnstile updates where $SA$ and $RA^{T}$ are maintained in the stream. $S$ is drawn from a distribution satisfying both the $(1/2,\eta^{-r\log(1/\delta)/\varepsilon^{2}}\delta)$ and $(\varepsilon\sqrt{\varepsilon/(r\log(1/\delta))},\delta)$ -JL moment properties. $R$ is drawn from a distribution satisfying both the $(1/2,\eta^{-r}\delta)$ and $(\sqrt{\varepsilon/r},\delta)$ -JL moment properties. Theorem 4.9 of then shows how to compute a matrix of rank $r$ which achieves the desired error guarantee given $SA$ and $RA^{T}$ .

Let $\mathbf{R}$ be a ring, and let $q\in\mathbf{R}[x]$ be a degree- $t$ polynomial. Then, given distinct $x_{1},\ldots,x_{t}\in\mathbf{R}$ , all the values $q(x_{1}),\ldots,q(x_{t})$ can be computed using $O(t\log^{2}t\log\log t)$ operations over $\mathbf{R}$ .

Open Problems

In this section we state two explicit open problems. For the first, observe that our graph construction is quite similar to a sparse JL construction of Achlioptas . The work of proposes a random normalized sign matrix where each column has an expected number $s$ of non-zero entries, so that in the notation of this work, the $\eta_{i,j}$ are i.i.d. Bernoulli with expectation $s/k$ . Using this construction, was able to achieve $s=k/3$ without causing $k$ to increase over analyses of dense constructions, even by a constant factor. Meanwhile, our graph construction requires that there be exactly $s$ non-zero entries per column. This sole change was the reason we were able to obtain better asymptotic bounds on the sparsity of $S$ in this work, but in fact we conjecture an even stronger benefit than just asymptotic improvement. The first open problem is to resolve the following conjecture.

A positive resolution of this conjecture would imply that not only does our graph construction obtain better asymptotic performance than , but in fact obtains stronger performance in a very definitive sense.

Question: Can we obtain a tight lower bound of $s=\Omega(\varepsilon^{-1}\log(1/\delta))$ for distributional JL in the case that $k=O(\varepsilon^{-2}\log(1/\delta))<d/2$ , thus removing the $O(\log(1/\varepsilon))$ factor gap?