Sparser Johnson-Lindenstrauss Transforms

Daniel M. Kane, Jelani Nelson

Introduction

Proofs of the JL lemma can be found in . The value of kk in the JL lemma is optimal (also see a later proof in ).

The JL lemma is a key ingredient in the JL flattening theorem, which states that any nn points in Euclidean space can be embedded into O(ε2logn)O(\varepsilon^{-2}\log n) dimensions so that all pairwise Euclidean distances are preserved up to 1±ε1\pm\varepsilon. The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, and can further be used to reduce the amount of storage required to store a dataset, e.g. in streaming algorithms. Recently it has also found applications in approximate numerical algebra problems such as linear regression and low-rank approximation . See for further discussions on applications.

Standard proofs of the JL lemma take a distribution over dense matrices (e.g. i.i.d. Gaussian or Bernoulli entries), and thus performing the embedding naïvely takes O(kx0)O(k\cdot\|x\|_{0}) time where xx has x0\|x\|_{0} non-zero entries. Several works have devised other distributions which give faster embedding times , but all these methods require Ω(dlogd)\Omega(d\log d) embedding time even for sparse vectors (even when x0=1\|x\|_{0}=1). This feature is particularly unfortunate in streaming applications, where a vector xx receives coordinate-wise updates of the form xx+veix\leftarrow x+v\cdot e_{i}, so that to maintain some linear embedding SxSx of xx we should repeatedly calculate SeiSe_{i} during updates. Since ei0=1\|e_{i}\|_{0}=1, even the naïve O(kei0)O(k\cdot\|e_{i}\|_{0}) embedding time method is faster than these approaches.

Even aside from streaming applications, several practical situations give rise to vectors with x0d\|x\|_{0}\ll d. For example, a common similarity measure for comparing text documents in data mining and information retrieval is cosine similarity , which is approximately preserved under any JL embedding. Here, a document is represented as a bag of words with the dimensionality dd being the size of the lexicon, and we usually would not expect any single document to contain anywhere near dd distinct words (i.e., we expect sparse vectors). In networking applications, if xi,jx_{i,j} counts bytes sent from source ii to destination jj in some time interval, then dd is the total number of IP pairs, whereas we would not expect most pairs of IPs to communicate with each other. In linear algebra applications, a rating matrix AA may for example have Ai,jA_{i,j} as user ii’s score for item jj (e.g. the Netflix matrix where columns correspond to movies), and we would expect that most users rate only small fraction of all available items.

It is also worth nothing that after the preliminary version of this work was published in , it was shown in that our bound is optimal up to an O(log(1/ε))O(\log(1/\varepsilon)) factor. That is, for any fixed constant c>0c>0, any distribution satisfying Lemma 1 that is supported on matrices with k=O(εclog(1/δ))k=O(\varepsilon^{-c}\log(1/\delta)) and at most ss non-zero entries per column must have s=Ω(ε1log(1/δ)/log(1/ε))s=\Omega(\varepsilon^{-1}\log(1/\delta)/\log(1/\varepsilon)) as long as k=O(d/log(1/ε))k=O(d/\log(1/\varepsilon)). Note that once kdk\geq d one can always take the distribution supported solely on the d×dd\times d identity matrix, giving s=1s=1 and satisfying Lemma 1 with ε=0\varepsilon=0.

1 Our Approach

Our constructions are depicted in Figure 1. Figure 1(a) represents the DKS construction of in which each item is hashed to ss random target coordinates with replacement. Our two schemes achieving s=Θ(ε1log(1/δ))s=\Theta(\varepsilon^{-1}\log(1/\delta)) are as follows. Construction (b) is much like (a) except that we hash coordinates ss times without replacement; we call this the graph construction, since hash locations are specified by a bipartite graph with dd left vertices, kk right vertices, and left-degree ss. In (c), the target vector is divided into ss contiguous blocks each of equal size k/sk/s, and a given coordinate in the original vector is hashed to a random location in each block (essentially this is the CountSketch of , though we use a higher degree of independence in our hash functions); we call this the block construction. In all cases (a), (b), and (c), we randomly flip the sign of a coordinate in the original vector and divide by s\sqrt{s} before adding it in any location in the target vector.

We give two different analyses for both our constructions (b) and (c). Since we consider linear embeddings, without loss of generality we can assume x2=1\|x\|_{2}=1, in which case the JL lemma follows by showing that Sx22[(1ε)2,(1+ε)2]\|Sx\|_{2}^{2}\in[(1-\varepsilon)^{2},(1+\varepsilon)^{2}], which is implied by Sx2212εε2|\|Sx\|_{2}^{2}-1|\leq 2\varepsilon-\varepsilon^{2}. Thus it suffices to show that for any unit norm xx,

We furthermore observe that both our graph and block constructions have the property that the entries of our embedding matrix SS can be written as

where the σi,j\sigma_{i,j} are independent and uniform in {1,1}\{-1,1\}, and ηi,j\eta_{i,j} is an indicator random variable for the event Si,j0S_{i,j}\neq 0 (in fact in our analyses we will only need that the σi,j\sigma_{i,j} are O(log(1/δ))O(\log(1/\delta))-wise independent). Note that the ηi,j\eta_{i,j} are not independent, since in both constructions we have that there are exactly ss non-zero entries per column. Furthermore in the block construction, knowing that ηi,j=1\eta_{i,j}=1 for jj in some block implies that ηi,j=0\eta_{i,j^{\prime}}=0 for all other jj^{\prime} in the same block.

To outline our analyses, look at the random variable

In our second analysis approach, we define

We point out here that Figure 1(c) is somewhat simpler to implement, since there are simple constructions of O(log(1/δ))O(\log(1/\delta))-wise hash families . Figure 1(b) on the other hand requires hashing without replacement, which amounts to using random permutations and can be derandomized using almost O(log(1/δ))O(\log(1/\delta))-wise independent permutation families (see Remark 14).

Conventions and Notation

Code-Based Constructions

In this section, we provide analyses of our constructions (b) and (c) in Figure 1 when the non-zero entry locations are deterministic but satisfy a certain condition. In particular, in the analysis in this section we assume that for any ij[d]i\neq j\in[d],

That is, no two columns have their non-zero entries in more than O(s2/k)O(s^{2}/k) of the same rows. We show how to use error-correcting codes to ensure Eq. (6) in Remark 8 for the block construction, and in Remark 9 for the graph construction. Unfortunately this step will require setting ss to be slightly larger than the desired O(ε1log(1/δ))O(\varepsilon^{-1}\log(1/\delta)). We give an alternate analysis in Section 4 which avoids assuming Eq. (6) and obtains an improved bound for ss by not using deterministic ηr,i\eta_{r,i}.

We prove our construction satisfies the JL lemma by applying Theorem 4 with z=σ,B=Tz=\sigma,B=T.

where the first inequality used Eq. (6). \blacksquare

By Eq. (1), it now suffices to prove the following theorem.

We now discuss how to choose the non-zero locations in SS to ensure Eq. (6).

It is also possible to use a code to specify the hash locations in the graph construction. In particular, let the jjth entry of the iith column of the embedding matrix be the jjth symbol of the iith codeword (which we call h(i,j)h(i,j)) in a weight-ss binary code of minimum distance 2sO(s2/k)2s-O(s^{2}/k) for s2ε1log(1/δ)s\geq 2\varepsilon^{-1}\log(1/\delta). Define ηi,j,r\eta_{i,j,r} for i,j[d],r[s]i,j\in[d],r\in[s] as an indicator variable for h(i,r)=h(j,r)=1h(i,r)=h(j,r)=1. Then, the error is again exactly as in Eq. (3). Also, as in Remark 8, such a code can be shown to exist via the probabilistic method (the Chernoff bound can be applied using negative dependence, followed by a union bound) as long as s=Ω(ε1log(d/δ)log(1/δ))s=\Omega(\varepsilon^{-1}\sqrt{\log(d/\delta)\log(1/\delta)}). We omit the details since Section 4 obtains better parameters.

Only using Eq. (6), it is impossible to improve our sparsity bound further. For example, consider an instantiation of the block construction in which Eq. (6) is satisfied. Create a new set of ηr,i\eta_{r,i} which change only in the case r=1r=1 so that η1,i=1\eta_{1,i}=1 for all ii, so that Eq. (6) still holds. In our construction this corresponds to all indices colliding in the first chunk of k/sk/s coordinates, which creates an error term of (1/s)ijxixjσr,iσr,j(1/s)\cdot\sum_{i\neq j}x_{i}x_{j}\sigma_{r,i}\sigma_{r,j}. Now, suppose xx consists of t=(1/2)log(1/δ)t=(1/2)\cdot\log(1/\delta) entries each with value 1/t1/\sqrt{t}. Then, with probability δδ\sqrt{\delta}\gg\delta, all these entries receive the same sign under σ\sigma and contribute a total error of Ω(t/s)\Omega(t/s) in the first chunk alone. We thus need t/s=O(ε)t/s=O(\varepsilon), which implies s=Ω(ε1log(1/δ))s=\Omega(\varepsilon^{-1}\log(1/\delta)).

Random Hashing Constructions

In this section, we show that if the hash functions hh described in Remark 8 and Remark 9 are not specified by fixed codes, but rather are chosen at random from some family of sufficiently high independence, then one can achieve sparsity O(ε1log(1/δ))O(\varepsilon^{-1}\log(1/\delta)) (in the case of Figure 1(b), we actually need almost k-wise independent permutations). Recall our bottleneck in reducing the sparsity in Section 3 was actually obtaining the codes, discussed in Remark 8 and Remark 9.

Define Gt\mathcal{G}_{t} as the set of directed multigraphs with tt edges having distinct labels in [t][t] and no self-loops, with between 22 and tt vertices (inclusive), and where every vertex has non-zero and even degree (we use degree to denote the sum of in- and out-degrees). Let ff map variable sequences to their corresponding graph. That is, we draw a directed edge labeled uu from the vertex representing iui_{u} to that representing juj_{u} for u=1,,tu=1,\ldots,t, where one vertex represents all the iu,jui_{u},j_{u} which are assigned the same element of [d][d] (see Figure 2). For a graph GG, let vv be its number of vertices, and let dud_{u} be the degree of vertex uu. By construction every monomial maps to a graph with tt edges. Also we need only consider graphs with all even vertex degrees since a monomial whose graph has at least one vertex with odd degree will have at least one random sign σi,ru\sigma_{i,r_{u}} appearing an odd number of times and thus have expectation zero. Then,

where Gt\mathcal{G}^{\prime}_{t} is the set of all directed multigraphs as in Gt\mathcal{G}_{t}, but in which vertices are labeled as well, with distinct labels in [v][v] (see Figure 2; the vertex labels can be arbitrarily permuted).

Eq. (10) used that ηr,1,,ηr,d\eta_{r,1},\ldots,\eta_{r,d} are independent for any rr. For Eq. (11), note that (x22)t=1(\|x\|_{2}^{2})^{t}=1, and the coefficient of u=1vxaudu\prod_{u=1}^{v}x_{a_{u}}^{d_{u}} in its expansion for u=1vdu=2t\sum_{u=1}^{v}d_{u}=2t is (td1/2,,dv/2)\binom{t}{d_{1}/2,\ldots,d_{v}/2}. Meanwhile, the coefficient of this monomial when summing over all i1j1,,itjti_{1}\neq j_{1},\ldots,i_{t}\neq j_{t} for a particular GGtG\in\mathcal{G}_{t} is at most v!v!. For Eq. (12), we move from graphs in Gt\mathcal{G}_{t} to those in Gt\mathcal{G}_{t}^{\prime}, and for any GGtG\in\mathcal{G}_{t} there are exactly v!v! ways to label vertices. This is because for any graph GGtG\in\mathcal{G}_{t} there is a canonical way of labeling the vertices as 1,,v1,\ldots,v since there are no isolated vertices. Namely, the vertices can be labeled in increasing order of when they are first visited by an edge when processing edges in order of increasing label (if two vertices are both visited for the first time simultaneously by some edge, then we can break ties consistently using the direction of the edge). Thus the vertices are all identified by this canonical labeling, implying that the v!v! vertex labelings all give distinct graphs in Gt\mathcal{G}^{\prime}_{t}. Eq. (13) follows since t!tt/ett!\geq t^{t}/e^{t} and

The summation over GG in Eq. (13) is over the GGtG\in\mathcal{G}_{t}^{\prime} with vv vertices. Let us bound this summation for some fixed choice of vertex degrees d1,,dvd_{1},\ldots,d_{v}. For any given ii, consider the set of all graphs Gi\mathcal{G}^{\prime\prime}_{i} on vv labeled vertices with distinct labels in [v][v], and with ii edges with distinct labels in [i][i] (that is, we do not require even edge degrees, and some vertices may even have degree ). For a graph GGiG\in\mathcal{G}^{\prime\prime}_{i}, let dud_{u}^{\prime} represent the degree of vertex uu in GG. For a1,,av>0a_{1},\ldots,a_{v}>0 define the function

Let Gt(d1,,dv)\mathcal{G^{\prime}}_{t}(d_{1},\ldots,d_{v}) be those graphs GGtG\in\mathcal{G}^{\prime}_{t} with vv vertices such that vertex uu has degree dud_{u}. Then

since Gt(d1,,dv)Gt\mathcal{G}^{\prime}_{t}(d_{1},\ldots,d_{v})\subset\mathcal{G}^{\prime\prime}_{t}. To upper bound St(a1,,av)S_{t}(a_{1},\ldots,a_{v}), note S0(a1,,av)=1S_{0}(a_{1},\ldots,a_{v})=1. For i>1i>1, note any graph in Gi\mathcal{G^{\prime\prime}}_{i} can be formed by taking a graph GGi1G\in\mathcal{G^{\prime\prime}}_{i-1} and adding an edge labeled ii from uu to ww for some vertices uwu\neq w in GG. This change causes du,dwd_{u}^{\prime},d_{w}^{\prime} to both increase by 11, whereas all other degrees stay the same. Thus considering Eq. (14),

with the last inequality using Cauchy-Schwarz. Thus by induction, St(a1,,av)(u=1vau)tvtS_{t}(a_{1},\ldots,a_{v})\leq(\sum_{u=1}^{v}a_{u})^{t}\cdot v^{t}. Since u=1vdu=2t\sum_{u=1}^{v}d_{u}=2t, we have St(d1,,dv)(2tv)tS_{t}(d_{1},\ldots,d_{v})\leq(2tv)^{t}. We then have that the summation in Eq. (13) is at most the number of choices of even d1,,dvd_{1},\ldots,d_{v} summing to 2t2t (there are (t1v1)<2t\binom{t-1}{v-1}<2^{t} such choices), times (2tv)t(2tv)^{t}, implying

By differentiation, the quantity (s/k)vvt(s/k)^{v}v^{t} is maximized for v=max{2,t/ln(k/s)}v=\max\left\{2,t/\ln(k/s)\right\} (recall v2v\geq 2), giving our lemma. \blacksquare

Proof. We use Lemma 11. In the case t<2ln(k/s)t<2\ln(k/s) we can multiply the (s/k)2(s/k)^{2} term by ttt^{t} and still obtain an upper bound, and in the case of larger tt we have (t/ln(k/s))ttt(t/\ln(k/s))^{t}\leq t^{t} since ksk\geq s. Also when t2ln(k/s)t\geq 2\ln(k/s) we have et(s/k)21e^{t}(s/k)^{2}\geq 1, so that t(2e2)tttt(2e3)t(s/k)2ttt(2e^{2})^{t}t^{t}\leq t(2e^{3})^{t}(s/k)^{2}t^{t}. \blacksquare

It is worth noting that if one wants distortion 1±εi1\pm\varepsilon_{i} with probability 1δi1-\delta_{i} simultaneously for all ii in some set SS, our proof of Theorem 13 reveals that it suffices to set s=CsupiSεi1log(1/δi)s=C\cdot\sup_{i\in S}\varepsilon_{i}^{-1}\log(1/\delta_{i}) and k=CsupiSεi2log(1/δi)k=C\cdot\sup_{i\in S}\varepsilon_{i}^{-2}\log(1/\delta_{i}).

Tightness of analyses

The main theorem of this section is the following.

The DKS construction of requires sparsity s=Ω(ε1log2(1/δ)/log2(1/ε))s=\Omega(\varepsilon^{-1}\cdot\left\lceil\log^{2}(1/\delta)/\log^{2}(1/\varepsilon)\right\rceil) to achieve distortion 1±ε1\pm\varepsilon with success probability 1δ1-\delta.

Our proof will use the following standard fact.

which is much larger than δ/2\delta/2 for δ\delta smaller than some constant. Now, given a collision, the colliding items have the same sign with probability 1/21/2.

which is much larger than δ/2\delta/2 for δ\delta smaller than some constant. Now, given a collision, the colliding items have the same sign with probability 1/81/8.

We lastly consider the case 4/ε<s2cε1log2(1/δ)/log2(1/ε)4/\varepsilon<s\leq 2c\varepsilon^{-1}\log^{2}(1/\delta)/\log^{2}(1/\varepsilon) for some constant c>0c>0 (depending on CC) to be determined later. First note this case only exists when δ=O(ε)\delta=O(\varepsilon). Define x=(1,0,,0)x=(1,0,\ldots,0). Suppose there exists an integer qq so that

First we show it is possible to satisfy the above conditions simultaneously for our range of ss. We set q=2εsq=2\sqrt{\varepsilon s}, satisfying item 1 trivially, and item 2 since s>4/εs>4/\varepsilon. For item 3, Fact 17 gives

The es/k(1(s/k2))e^{-s/k}\cdot(1-(s/k^{2})) term is at least δ1/6\delta^{1/6} by the settings of s,ks,k, and the (s/(qk))q(s/(qk))^{q} term is also at least δ1/6\delta^{1/6} for cc sufficiently small.

2 Tightness of Figure 1(b) analysis

For δ\delta smaller than a constant depending on CC for k=Cε2log(1/δ)k=C\varepsilon^{-2}\log(1/\delta), the graph construction of Section 4 requires s=Ω(ε1log(1/δ))s=\Omega(\varepsilon^{-1}\log(1/\delta)) to obtain distortion 1±ε1\pm\varepsilon with probability 1δ1-\delta.

Proof. First suppose s1/(2ε)s\leq 1/(2\varepsilon). We consider a vector with t=1/(sε)t=\left\lfloor 1/(s\varepsilon)\right\rfloor non-zero coordinates each of value 1/t1/\sqrt{t}. If there is exactly one set i,j,ri,j,r with iji\neq j such that Sr,i,Sr,jS_{r,i},S_{r,j} are both non-zero for the embedding matrix SS (i.e., there is exactly one collision), then the total error is 2/(ts)2ε2/(ts)\geq 2\varepsilon. It just remains to show that this happens with probability larger than δ\delta. The probability of this occurring is

Now consider the case 1/(2ε)<s<cε1log(1/δ)1/(2\varepsilon)<s<c\cdot\varepsilon^{-1}\log(1/\delta) for some small constant cc. Consider the vector (1/2,1/2,0,,0)(1/\sqrt{2},1/\sqrt{2},0,\ldots,0). Suppose there are exactly 2sε2s\varepsilon collisions, i.e. 2sε2s\varepsilon distinct values of rr such that Sr,i,Sj,rS_{r,i},S_{j,r} are both non-zero (to avoid tedium we disregard floors and ceilings and just assume sεs\varepsilon is an integer). Also, suppose that in each colliding row rr we have σ(1,r)=σ(2,r)\sigma(1,r)=\sigma(2,r). Then, the total error would be 2ε2\varepsilon. It just remains to show that this happens with probability larger than δ\delta. The probability of signs agreeing in exactly 2εs2\varepsilon s chunks is 22εs>22clog(1/δ)2^{-2\varepsilon s}>2^{-2c\log(1/\delta)}, which is larger than δ\sqrt{\delta} for c<1/4c<1/4. The probability of exactly 2εs2\varepsilon s collisions is

It suffices for the right hand side to be at least δ\sqrt{\delta} since hh is independent of σ\sigma, and thus the total probability of error larger than 2ε2\varepsilon would be greater than δ2=δ\sqrt{\delta}^{2}=\delta. Taking natural logarithms, it suffices to have

Writing s=q/εs=q/\varepsilon and a=4Clog(1/δ)a=4C\log(1/\delta), the left hand side is 2qln(a/q)+Θ(s2/k)2q\ln(a/q)+\Theta(s^{2}/k). Taking a derivative shows 2qln(a/q)2q\ln(a/q) is monotonically increasing for q<a/eq<a/e. Thus as long as q<caq<ca for a sufficiently small constant cc, 2qln(a/q)<ln(1/δ)/42q\ln(a/q)<\ln(1/\delta)/4. Also, the Θ(s2/k)\Theta(s^{2}/k) term is at most ln(1/δ)/4\ln(1/\delta)/4 for cc sufficiently small. \blacksquare

3 Tightness of Figure 1(c) analysis

For δ\delta smaller than a constant depending on CC for k=Cε2log(1/δ)k=C\varepsilon^{-2}\log(1/\delta), the block construction of Section 4 requires s=Ω(ε1log(1/δ))s=\Omega(\varepsilon^{-1}\log(1/\delta)) to obtain distortion 1±ε1\pm\varepsilon with probability 1δ1-\delta.

Proof. First suppose s1/(2ε)s\leq 1/(2\varepsilon). Consider a vector with t=1/(sε)t=\left\lfloor 1/(s\varepsilon)\right\rfloor non-zero coordinates each of value 1/t1/\sqrt{t}. If there is exactly one set i,j,ri,j,r with iji\neq j such that h(i,r)=h(j,r)h(i,r)=h(j,r) (i.e. exactly one collision), then the total error is 2/(ts)2ε2/(ts)\geq 2\varepsilon. It just remains to show that this happens with probability larger than δ\delta.

The probability of exactly one collision is

which is larger than δ\delta for δ\delta smaller than a universal constant.

Now consider 1/(2ε)<s<cε1log(1/δ)1/(2\varepsilon)<s<c\cdot\varepsilon^{-1}\log(1/\delta) for some small constant cc. Consider the vector x=(1/2,1/2,0,,0)x=(1/\sqrt{2},1/\sqrt{2},0,\ldots,0). Suppose there are exactly 2sε2s\varepsilon collisions, i.e. 2sε2s\varepsilon distinct values of rr such that h(1,r)=h(2,r)h(1,r)=h(2,r) (to avoid tedium we disregard floors and ceilings and just assume sεs\varepsilon is an integer). Also, suppose that in each colliding chunk rr we have σ(1,r)=σ(2,r)\sigma(1,r)=\sigma(2,r). Then, the total error would be 2ε2\varepsilon. It just remains to show that this happens with probability larger than δ\delta. The probability of signs agreeing in exactly 2εs2\varepsilon s chunks is 22εs>22clog(1/δ)2^{-2\varepsilon s}>2^{-2c\log(1/\delta)}, which is larger than δ\sqrt{\delta} for c<1/4c<1/4. The probability of exactly 2εs2\varepsilon s collisions is

The above is at most δ\sqrt{\delta}, by the analysis following Eq. (22). Since hh is independent of σ\sigma, the total probability of having error larger than 2ε2\varepsilon is greater than δ2=δ\sqrt{\delta}^{2}=\delta. \blacksquare

Faster numerical linear algebra streaming algorithms

Now, the following theorem is a generalization of [10, Theorem 2.1]. The theorem states that any distribution with JL moments also provides a sketch for approximate matrix products. A similar statement was made in [34, Lemma 6], but that statement was slightly weaker in its parameters because it resorted to a union bound, which we avoid by using Minkowski’s inequality.

Often when one constructs a JL distribution D\mathcal{D} over k×dk\times d matrices, it is shown that for all xx with x2=1\|x\|_{2}=1 and for all ε>0\varepsilon>0,

Any such distribution automatically satisfies the (ε,eΩ(ε2k+εk),min{ε2k,εk})(\varepsilon,e^{-\Omega(\varepsilon^{2}k+\varepsilon k)},\min\{\varepsilon^{2}k,\varepsilon k\})-JL moment property for any ε>0\varepsilon>0 by converting the tail bound into a moment bound via integration by parts.

Now we arrive at the main point of this section. Several algorithms for approximate linear regression and best rank-kk approximation in simply maintain SASA as AA is updated, where SS comes from the JL distribution with Ω(log(1/δ))\Omega(\log(1/\delta))-wise independent ±1/k\pm 1/\sqrt{k} entries. In fact though, their analyses of their algorithms only use the fact that this distribution satisfies the approximate matrix product sketch guarantees of Theorem 21. Due to Theorem 21 though, we know that any distribution satisfying the (ε,δ)(\varepsilon,\delta)-JL moment condition gives an approximate matrix product sketch. Thus, random Bernoulli matrices may be replaced with our sparse JL distributions in this work. We now state some of the algorithmic results given in and describe how our constructions provide improvements in the update time (the time to process new columns, rows, or turnstile updates).

As in , when stating our results we will ignore the space and time complexities of storing and evaluating the hash functions in our JL distributions. We discuss this issue later in Remark 26.

There is a one-pass streaming algorithm for linear regression in the turnstile model where one maintains a sketch of size O(n2ε1log(1/δ)log(nd))O(n^{2}\varepsilon^{-1}\log(1/\delta)\log(nd)). Processing each update requires O(n+n/εlog(1/δ))O(n+\sqrt{n/\varepsilon}\cdot\log(1/\delta)) arithmetic operations and hash function evaluations.

Theorem 24 improves the update complexity of , which was O(nε1log(1/δ))O(n\varepsilon^{-1}\log(1/\delta)).

2 Low rank approximation

Theorem 4.4 of gives a 22-pass algorithm where in the first pass, one maintains SASA where SS is drawn from a distribution that simultaneously satisfies both the (1/2,ηrδ)(1/2,\eta^{-r}\delta) and (ε/r,δ)(\sqrt{\varepsilon/r},\delta)-JL moment properties for some fixed constant η>1\eta>1 in their proof. It is also assumed that ρ2r+1\rho\geq 2r+1. The first pass is thus sped up again as in Theorem 24.

One-pass algorithm for column/row-wise updates:

Theorem 4.5 of gives a one-pass algorithm in the case that AA is seen either one whole column or row at a time. The algorithm maintains both SASA and SAATSAA^{T} where SS is drawn from a distribution that simultaneously satisfies both the (1/2,ηrδ)(1/2,\eta^{-r}\delta) and (ε/r,δ)(\sqrt{\varepsilon/r},\delta)-JL moment properties. This implies the following.

There is a one-pass streaming algorithm for approximate low rank approximation with row/column-wise updates where one maintains a sketch of size O(rε1(n+d)log(1/δ)log(nd))O(r\varepsilon^{-1}(n+d)\log(1/\delta)\log(nd)). Processing each update requires O(r+r/εlog(1/δ))O(r+\sqrt{r/\varepsilon}\cdot\log(1/\delta)) amortized arithmetic operations and hash function evaluations per entry of AA.

Theorem 25 improves the amortized update complexity of , which was O(rε1log(1/δ))O(r\varepsilon^{-1}\log(1/\delta)).

Three-pass algorithm for row-wise updates:

Theorem 4.6 of gives a three-pass algorithm using less space in the case that AA is seen one row at a time. Again, the first pass simply maintains SASA where SS is drawn from a distribution that satisfies both the (1/2,ηrδ)(1/2,\eta^{-r}\delta) and (ε/r,δ)(\sqrt{\varepsilon/r},\delta)-JL moment properties. This pass is sped up using our sparser JL distribution.

One-pass algorithm in the turnstile model, bi-criteria:

Theorem 4.7 of gives a one-pass algorithm under turnstile updates where SASA and RATRA^{T} are maintained in the stream. SS is drawn from a distribution satisfying both the (1/2,ηrlog(1/δ)/εδ)(1/2,\eta^{-r\log(1/\delta)/\varepsilon}\delta) and (ε/rlog(1/δ),δ)(\varepsilon/\sqrt{r\log(1/\delta)},\delta)-JL moment properties. RR is drawn from a distribution satisfying both the (1/2,ηrδ)(1/2,\eta^{-r}\delta) and (ε/r,δ)(\sqrt{\varepsilon/r},\delta)-JL moment properties. Theorem 4.7 of then shows how to compute a matrix of rank O(rε1log(1/δ))O(r\varepsilon^{-1}\log(1/\delta)) which achieves the desired error guarantee given SASA and RATRA^{T}.

One-pass algorithm in the turnstile model:

Theorem 4.9 of gives a one-pass algorithm under turnstile updates where SASA and RATRA^{T} are maintained in the stream. SS is drawn from a distribution satisfying both the (1/2,ηrlog(1/δ)/ε2δ)(1/2,\eta^{-r\log(1/\delta)/\varepsilon^{2}}\delta) and (εε/(rlog(1/δ)),δ)(\varepsilon\sqrt{\varepsilon/(r\log(1/\delta))},\delta)-JL moment properties. RR is drawn from a distribution satisfying both the (1/2,ηrδ)(1/2,\eta^{-r}\delta) and (ε/r,δ)(\sqrt{\varepsilon/r},\delta)-JL moment properties. Theorem 4.9 of then shows how to compute a matrix of rank rr which achieves the desired error guarantee given SASA and RATRA^{T}.

Let R\mathbf{R} be a ring, and let qR[x]q\in\mathbf{R}[x] be a degree-tt polynomial. Then, given distinct x1,,xtRx_{1},\ldots,x_{t}\in\mathbf{R}, all the values q(x1),,q(xt)q(x_{1}),\ldots,q(x_{t}) can be computed using O(tlog2tloglogt)O(t\log^{2}t\log\log t) operations over R\mathbf{R}.

Open Problems

In this section we state two explicit open problems. For the first, observe that our graph construction is quite similar to a sparse JL construction of Achlioptas . The work of proposes a random normalized sign matrix where each column has an expected number ss of non-zero entries, so that in the notation of this work, the ηi,j\eta_{i,j} are i.i.d. Bernoulli with expectation s/ks/k. Using this construction, was able to achieve s=k/3s=k/3 without causing kk to increase over analyses of dense constructions, even by a constant factor. Meanwhile, our graph construction requires that there be exactly ss non-zero entries per column. This sole change was the reason we were able to obtain better asymptotic bounds on the sparsity of SS in this work, but in fact we conjecture an even stronger benefit than just asymptotic improvement. The first open problem is to resolve the following conjecture.

A positive resolution of this conjecture would imply that not only does our graph construction obtain better asymptotic performance than , but in fact obtains stronger performance in a very definitive sense.

Question: Can we obtain a tight lower bound of s=Ω(ε1log(1/δ))s=\Omega(\varepsilon^{-1}\log(1/\delta)) for distributional JL in the case that k=O(ε2log(1/δ))<d/2k=O(\varepsilon^{-2}\log(1/\delta))<d/2, thus removing the O(log(1/ε))O(\log(1/\varepsilon)) factor gap?

Acknowledgments

References