Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

Alexandr Andoni, Ilya Razenshteyn

Introduction

We study lower bounds for the high-dimensional nearest neighbor search problem, which is a problem of major importance in several areas, such as databases, data mining, information retrieval, computer vision, computational geometry, signal processing, etc. This problem suffers from the “curse of dimensionality” phenomenon: either space or query time are exponential in the dimension dd. To escape this curse, researchers proposed approximation algorithms for the problem. In the (c,r)(c,r)-approximate near neighbor problem, the data structure may return any data point whose distance from the query is at most crcr, for an approximation factor c>1c>1 (provided that there exists a data point within distance rr from the query). Many approximation algorithms are known for this problem: e.g., see surveys [Sam06, AI08, And09, WSSJ14].

An influential algorithmic technique for the approximate near neighbor search (ANN) is the Locality Sensitive Hashing (LSH) [IM98, HPIM12]. The main idea is to hash the points so that the probability of collision is much higher for points that are close to each other (at distance r\leq r) than for those which are far apart (at distance >cr>cr). Given such hash functions, one can retrieve near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. If the probability of collision is at least p1p_{1} for the close points and at most p2p_{2} for the far points, the algorithm solves the (c,r)(c,r)-ANN using essentially O(n1+ρ/p1)O(n^{1+\rho}/p_{1}) extra space and O(dnρ/p1)O(dn^{\rho}/p_{1}) query time, where ρ=log(1/p1)log(1/p2)\rho=\tfrac{\log(1/p_{1})}{\log(1/p_{2})} [HPIM12]. The value of the exponent ρ\rho thus determines the “quality” of the LSH families used.

Here we prove that the exponent ρ=12cp1\rho=\tfrac{1}{2c^{p}-1} from [AR15] is essentially optimal even for data-dependent hashing, and cannot be improved upon. Stating the precise theorem requires introducing the precise model for the lower bound, which we accomplish below. For now, we state our main theorem informally:

as long as the description complexity of the hash functions is sufficiently small.

1 Lower Bound Model

To state the precise theorem, we need to formally describe what is data-dependent hashing. First, we state the definition of (data-independent) LSH, as well as define LSH for a fixed dataset PP.

We say that a hash family H\mathcal{H} over {0,1}d\{0,1\}^{d} is (r1,r2,p1,p2)(r_{1},r_{2},p_{1},p_{2})-sensitive, if for every u,v{0,1}du,v\in\{0,1\}^{d} one has:

We now refine the notion of data-independent LSH, where we require the distribution to work only for a particular dataset PP.

A hash family H\mathcal{H} over {0,1}d\{0,1\}^{d} is said to be (r1,r2,p1,p2)(r_{1},r_{2},p_{1},p_{2})-sensitive for a dataset P{0,1}dP\subseteq\{0,1\}^{d}, if:

for every v{0,1}dv\in\{0,1\}^{d} and every uPu\in P with uv1r1\|u-v\|_{1}\leq r_{1} one has

Note that the second definition is less stringent than the first one: in fact, if all the points in a dataset are at distance more than r2r_{2} from each other, then an LSH family H\cal H is also LSH for PP, but not necessarily vice versa! Furthermore, in the second definition, we require the second property to hold only on average (in contrast to every point as in the first definition). This aspect means that, while Definition 1.3 is certainly necessary for an ANN data structure, it is not obviously sufficient. Indeed, the algorithm from [AR15] requires proving additional properties of their partitioning scheme, and in particular analyzes triples of points. Since here we focus on lower bounds, this aspect will not be important.

We are now ready to introduce data-dependent hashing.

Intuitively, a data-dependent hashing scheme is one where we can pick the family H\cal H as a function of PP, and thus be sensitive for PP. An obvious solution would hence be to choose H\cal H to consist of a single hash function hh which is just the Voronoi diagram of the dataset PP: it will be (r,cr,1,0)(r,cr,1,0)-sensitive for PP and hence ρ=0\rho=0. However this does not yield a good data strcture for ANN since evaluating such a hash function on a query point qq is as hard as the original problem!

Hence, ideally, our lower bound model would require that the hash function is computationally efficient to evaluate. We do not know how to formulate such a condition which would not make the question as hard as circuit lower bounds or high cell-probe lower bounds, which would be well beyond the scope of this paper.

Instead, we introduce a condition on the hash family that can be roughly summarized as “hash functions from the family are succinct”. For a precise definition and discussion see below. For now, let us point out that all the known algorithmic results satisfy this condition.

Finally, we are ready to state the main result formally.

Fix the approximation c>1c>1 to be a constant. Suppose the dataset PP lives in the Hamming space {0,1}d\{0,1\}^{d}, where the dataset size n=Pn=|P| is such that d=ω(logn)d=\omega(\log n) as nn tends to infinity. There exist distance thresholds rr and (co(1))r(c-o(1))r with the following property.

Suppose there exist TT hash functions {hi}1iT\{h_{i}\}_{1\leq i\leq T} over {0,1}d\{0,1\}^{d} such that for every nn-point dataset PP there exists a distribution HP{\cal H}_{P} over hih_{i}’s such that the corresponding hash family is (r,(co(1))r,p1,p2)(r,(c-o(1))r,p_{1},p_{2})-sensitive for PP, where 0<p1,p2<0.990<p_{1},p_{2}<0.99. For any such data-dependent scheme it must either hold that ρ=log1/p1log1/p212c1o(1)\rho=\frac{\log 1/p_{1}}{\log 1/p_{2}}\geq\frac{1}{2c-1}-o(1) or that logTp1n1o(1)\frac{\log T}{p_{1}}\geq n^{1-o(1)}.

Interpreting Theorem 1.4

Let us explain the conditions and the conclusions of Theorem 1.4 in more detail.

We start by interpreting the conclusions. As explained above, the bound ρ=log1/p1log1/p212c1o(1)\rho=\frac{\log 1/p_{1}}{\log 1/p_{2}}\geq\frac{1}{2c-1}-o(1) directly implies the lower bound on the query time n12c1o(1)n^{\frac{1}{2c-1}-o(1)} for any scheme that is based on data-dependent hashing.

The second bound logTp1n1o(1)\frac{\log T}{p_{1}}\geq n^{1-o(1)} is a little bit more mysterious. Let us now explain what it means precisely. The quantity logT\log T can be interpreted as the description complexity of a hash function sampled from the family. At the same time, if we use a family with collision probability p1p_{1} for close points, we need at least 1/p11/p_{1} hash tables to achieve constant probability of success. Since in each hash table we evaluate at least one hash function, the quantity logTp1\frac{\log T}{p_{1}} can be interpreted as the lower bound for the total space occupied by hash functions we evaluate during each query. In all known constructions of (data-independent or data-dependent) LSH families [IM98, DIIM04, AI06, TT07, AINR14, AR15] the evaluation time of a single hash function is comparable to the space it occupies (for discussion regarding why is it true for [AR15], see Appendix A), thus, under this assumption, we can not achieve query time n1Ω(1)n^{1-\Omega(1)}, unless ρ12c1o(1)\rho\geq\frac{1}{2c-1}-o(1). On the other hand, we can achieve ρ=0\rho=0 by considering a data-dependent hash family that consists only of the Voronoi diagram of a dataset (trivially, p1=1p_{1}=1 and p2=0p_{2}=0 for this case), thus the conclusion logTp1n1o(1)\frac{\log T}{p_{1}}\geq n^{1-o(1)} can not be omitted in generalFor the Voronoi diagram, logTn\log T\geq n, since to specify it, one needs at least nn bits.. Note that the “Voronoi diagram family” is very slow to evaluate: to locate a point we need to solve an instance of exact Nearest Neighbor Search, that is unlikely to be possible to do in strongly sublinear time. Thus, this family satisfies the above assumption “evaluation time is comparable to the space”.

We now turn to interpreting the conditions. We require that d=ω(logn)d=\omega(\log n)When ANN for the general dimension dd is being solved, one usually first performs some form of the dimensionality reduction [JL84, KOR00, DG03]. Since at this stage we do not want distances to be distorted by a factor more than 1+o(1)1+o(1), the target dimension is precisely ω(logn)\omega(\log n). So, the assumption d=ω(logn)d=\omega(\log n) in Theorem 1.4 in some sense captures a truly high-dimensional case.. We conjecture that this requirement is necessary and for d=O(logn)d=O(\log n) there is an LSH family that gives a better value of ρ\rho (the improvement, of course, would depend on the hidden constant in the expression d=O(logn)d=O(\log n)). Moreover, if one steps outside the pure data-dependent LSH framework, in a recent paper [BDGL15] an improved data structure for ANN for the case d=O(logn)d=O(\log n) is presented, which achieves an improvement similar to what we conjectured above.

2 Techniques and Related Work

There are two components to our lower bound, i.e., Theorem 1.4.

The first component is a lower bound for data-independent LSH for a random dataset. We show that in this case, we must have ρ12c1o(1)\rho\geq\tfrac{1}{2c-1}-o(1). This is in contrast to the lower bound of [OWZ11], who achieve a higher lower bound but for the case when the (far) points are correlated. Our lower bound is closer in spirit to [MNP07], who also consider the case when the far points are random uncorrelated. In fact, this component is a strengthening of the lower bound from [MNP07], and is based crucially on an inequality proved there.

We mention that, in [Dub10], Dubiner has also considered the setting of a random dataset for a related problem—finding the closest pair in a given dataset PP. Dubiner sets up a certain related “bucketing” model, in which he conjectures the lower bound, which would imply a ρ12c1\rho\geq\tfrac{1}{2c-1} lower bound for data-independent LSH for a random set. Dubiner verifies the conjecture computationally and claims it is proved in a different manuscript.This manuscript does not appear to be available at the moment of writing of the present paper.

Our second component focuses on the data-dependent aspect of the lower bound. In particular, we prove that if there exists a data-dependent hashing scheme for a random dataset with a better ρ\rho, then in fact there is also a data-independent such hashing scheme. To accomplish this, we consider the “empirical” average p1p_{1} and p2p_{2} for a random dataset, and prove it is close to the average p1p_{1} and p2p_{2}, for which we can deduce a lower bound from the first component.

In terms of related work, we also must mention the papers of [PTW08, PTW10], who prove cell-probe lower bounds for the same problem of ANN. In particular, their work utilizes the lower bound of [MNP07] as well. Their results are however incomparable to our results: while their results are unconditional, our model allows us to prove a much higher lower bound, in particular matching the best algorithms.

Data-independent Lower Bound

In this section we prove the first component of Theorem 1.4. Overall we show a lower bound of ρ12c1+o(1)\rho\geq\frac{1}{2c-1}+o(1) for data-independent hash families for random datasets. Our proof is a strengthening of [MNP07]. The final statement appears as Corollary 2.7. In the second component, we will use a somewhat stronger statement, Lemma 2.6).

For u{0,1}du\in\{0,1\}^{d} and non-negative integer kk define a random variable Wk(u)W_{k}(u) distributed over {0,1}d\{0,1\}^{d} to be the resulting point of the standard random walk of length kk that starts in uu (at each step we flip a random coordinate).

We build on the following inequality from [MNP07] that is proved using Fourier analysis on {0,1}d\{0,1\}^{d}.

The above inequality can be thought of as a lower bound on ρ\rho already. In particular, the left-hand-side quantity is probability of collision of a point and a point generated via a random walk of length kk from it. The right-hand side corresponds to collision of random independent points, which are usually at distance d/2d/2. One already obtain a lower bound on ρ\rho by considering k=d/2ck=d/2c.

To strengthen the lower bound of [MNP07], we analyze carefully the distance between the endpoints of a random walk in the hypercube. In particular the next (somewhat folklore) lemmas show that this distance is somewhat smaller than the (trivial) upper bound of kk. Denote XkX_{k} the distance Wk(u)u1\|W_{k}(u)-u\|_{1}. Note that the distribution of XkX_{k} does not depend on a particular starting point uu.

We can now prove that the value of XkX_{k} indeed concentrates well around the expectation, using concentration inequalities.

For t1t\geq 1 define YtY_{t} to be the index of the coordinate that got flipped at time tt. Obviously, we have that XkX_{k} is a (deterministic) function of Y1,YkY_{1},\ldots Y_{k}. Furthermore, changing one YtY_{t} changes XkX_{k} by at most 2. Hence we can apply the McDiarmid’s inequality to Xk=f(Y1,Yk)X_{k}=f(Y_{1},\ldots Y_{k}):

Let Y1,YkY_{1},\ldots Y_{k} be independent random variables and X=f(Y1,Yk)X=f(Y_{1},\ldots Y_{k}) such that changing variable YtY_{t} only changes the value by at most ctc_{t}. Then we have that

Substituting ε=tk\varepsilon=t\cdot\sqrt{k}, we get the result. ∎

Note that choosing kd2lncc1k\approx\tfrac{d}{2}\cdot\ln\tfrac{c}{c-1} will mean that distance Xk=Wk(u)u1X_{k}=\|W_{k}(u)-u\|_{1} is now around d/2cd/2c, i.e., we can actually use random walks longer than the ones considered in [MNP07]. Indeed, from Lemma 2.2 and Lemma 2.3 one can immediately conclude the following corollary.

Let c>1c>1 be a fixed constant. Suppose that γ=γ(d)>0\gamma=\gamma(d)>0 is such that γ=o(1)\gamma=o(1) as dd\to\infty. Then, there exists α=α(d)\alpha=\alpha(d) such that

We are now ready to prove the main lemma of this section, which will be used in the later section on data-dependent hashing. We need to introduce two more definitions. We define “average p1p_{1}” as:

Let c>1c>1 be a fixed constant. Suppose that γ=γ(d)>0\gamma=\gamma(d)>0 is such that γ=o(1)\gamma=o(1) as dd\to\infty. Then, there exist α=α(d)\alpha=\alpha(d), β=β(d)\beta=\beta(d) and ρ=ρ(d)\rho=\rho(d) such that:

We use Lemma 2.5 to choose α=12lncc1+oc,γ(1)\alpha=\frac{1}{2}\cdot\ln\frac{c}{c-1}+o_{c,\gamma}(1) such that αd\alpha\cdot d is an odd integer and

We can choose β=oγ(1)\beta=o_{\gamma}(1) so that

(This follows from the standard Chernoff-type bounds.)

Finally, we combine (3), (1) and (2), and get the desired inequality. ∎

The following corollary shows how the above lemma implies a lower bound on data-independent LSH.

For every c>1c>1 and γ=γ(d)>0\gamma=\gamma(d)>0 such that γ=o(1)\gamma=o(1) there exists β=β(d)>0\beta=\beta(d)>0 with β=oγ(1)\beta=o_{\gamma}(1) such that if H\mathcal{H} is a data-independent (d/(2c),(1/2β)d,p1,p2)(d/(2c),(1/2-\beta)d,p_{1},p_{2})-sensitive family, then

We observe that for every α,β>0\alpha,\beta>0:

Now we apply Lemma 2.6 together with the following application of Jensen’s inequality:

Data-Dependent Hashing

We now prove the second component of the main Theorem 1.4 proof. In particular we show that a very good data-dependent hashing scheme would refute Lemma 2.6 from the previous section.

For a particular dataset PP, we will be interested in empirical probabilities p1,p2p_{1},p_{2} — i.e., the equivalents of ζ,μ\zeta,\mu for a given set PP — defined as follows. Let 0<δ(d)<1/30<\delta(d)<1/3 be some function. Let PP be a random set of points from {0,1}d\{0,1\}^{d} of size 2δ(d)d2^{\delta(d)\cdot d}. The empirical versions of ζ\zeta and η\eta with respect to PP are:

We now what to prove that, for a random dataset PP, the empirical ζ,μ\zeta,\mu are close to the true averages. For this we will need the following auxiliary lemma.

Let MM be an n×nn\times n symmetric matrix with entries from [0;1][0;1] and average ε\varepsilon. Let MM^{\prime} be a principal nδ×nδn^{\delta}\times n^{\delta} submatrix of MM sampled uniformly with replacement. Then, for every θ>0\theta>0, the probability that the maximum of the average over MM^{\prime} and θ\theta does not lie in [1/2;2]max{ε,θ}[1/2;2]\cdot\max\{\varepsilon,\theta\} is at most nδ2Ω(θnδ)n^{\delta}\cdot 2^{-\Omega(\theta n^{\delta})}.

We need the following version of Bernstein’s inequality.

We just apply Lemma 3.2 and take union bound over the rows of MM^{\prime}. ∎

The following two lemmas are immediate corollaries of Lemma 3.2 and Lemma 3.1, respectively.

2 Proof of Theorem 1.4

We are finally ready to complete the proof of the main result, Theorem 1.4.

Let us first assume that p1=o(1)p_{1}=o(1) and then show how to handle the general case. Suppose that n=P=2δdn=|P|=2^{\delta\cdot d}. By the assumption of Theorem 1.4, δ=o(1)\delta=o(1). Let {h1,h2,,hT}\{h_{1},h_{2},\ldots,h_{T}\} be a set of hash functions. We can assume that logTp1n1Ω(1)\frac{\log T}{p_{1}}\leq n^{1-\Omega(1)}, since otherwise we are done. Let us fix γ=γ(d)>0\gamma=\gamma(d)>0 such that γ=o(1)\gamma=o(1) and 2γdp1ω(1)2^{-\gamma d}\ll p_{1}^{\omega(1)}. We can do this, since if p1=2Ω(d)p_{1}=2^{-\Omega(d)}, then 1/p1=nω(1)1/p_{1}=n^{\omega(1)}, and the desired statement is true. Then, by Lemma 2.6, there is α=α(d)\alpha=\alpha(d) and β=β(d)=oγ(1)\beta=\beta(d)=o_{\gamma}(1) such that for every 1iT1\leq i\leq T

and θp1\theta\ll p_{1}. We can do it since, by the above assumption, logTp1n1Ω(1)=2δd(1Ω(1))\frac{\log T}{p_{1}}\leq n^{1-\Omega(1)}=2^{\delta\cdot d(1-\Omega(1))}.

Then, from Lemma 3.3 and Lemma 3.4 we get that, with high probability over the choice of PP, one has for every 1iT1\leq i\leq T:

max{ζ^(c,d,α,hi),θ}[1/2;2]max{ζ(c,d,α,hi),θ}\max\{\widehat{\zeta}(c,d,\alpha,h_{i}),\theta\}\in[1/2;2]\cdot\max\{\zeta(c,d,\alpha,h_{i}),\theta\};

max{η^(d,β,hi),θ}[1/2;2]max{η(d,β,hi),θ}\max\{\widehat{\eta}(d,\beta,h_{i}),\theta\}\in[1/2;2]\cdot\max\{\eta(d,\beta,h_{i}),\theta\}.

Suppose these conditions hold and assume there exist a distribution D\mathcal{D} over [T][T] such that the corresponding hash family is (d/2c,(1/2β(d))d,p1,p2)(d/2c,(1/2-\beta(d))d,p_{1},p_{2})-sensitive for PP. Then,

Averaging (4) and applying Jensen’s inequality, we have

which proves the theorem, since θp1\theta\ll p_{1}, 2γdp1ω(1)2^{-\gamma d}\ll p_{1}^{\omega(1)}, and p1=o(1)p_{1}=o(1).

Now let us deal with the case p1=Ω(1)p_{1}=\Omega(1). This can be reduced to the case p1=o(1)p_{1}=o(1) by choosing a slowly-growing super constant kk and replacing the set of TT functions with the set of TkT^{k} tuples of length kk. This replaces p1p_{1} and p2p_{2} with p1kp_{1}^{k} and p2kp_{2}^{k}, respectively. In the same time, we choose kk so that T=TkT^{\prime}=T^{k} still satisfy the hypothesis of the theorem. Then, we just apply the above proof.

References

Appendix A Upper Bounds Are in the Model

In this section, we show how the data-dependent hash family from [AR15] fits into the model of the lower bound from Section 1.1.

Let us briefly recall the hash family construction from [AR15]. For simplicity, assume that all points and queries lie on a sphere of radius RcrR\gg cr. First, consider a data-independent hash family from [AINR14, AR15], called Spherical LSH. It gives a good exponent ρ12c21+o(1)\rho\leq\frac{1}{2c^{2}-1}+o(1) for distance thresholds rr vs 2R\sqrt{2}R (the latter corresponds to a typical distance between a pair of points from the sphere). The main challenge that arises is how to handle distance thresholds rr vs crcr, where the latter may be much smaller than 2R\sqrt{2}R. Here comes the main insight of [AR15].

We would like to process the dataset so that the distance between a typical pair of data points is around 2R\sqrt{2}R, so to apply Spherical LSH. To accomplish this, we remove all the clusters of radius (2ε)R(\sqrt{2}-\varepsilon)R that contain lots of points (think of ε>0\varepsilon>0 being barely sub-constant). We will treat these clusters separately, and will focus on the remainder of the pointset for now. So for the remainder of the pointset, we just apply the Spherical LSH to it. This scheme turns out to satisfy the definition of the data-dependent hash family for the remaining points and for distance thresholds rr vs. crcr; in particular, the hash function is sensitive for the remaining set only! The intuition is that, for any potential query point, there is only a small number of data points within distance (2ε)R(\sqrt{2}-\varepsilon)R — otherwise, they would have formed yet another cluster, which we would have removed — and the larger distances are handled well by the Spherical LSH. Thus, for a typical pair of data points, Spherical LSH is sensitive for PP (see Definition 1.3).

How does [AR15] deal with the clusters? The main observation is that one can enclose such a cluster in a ball of radius (1Ω(ε2))R(1-\Omega(\varepsilon^{2}))R, which intuitively makes our problem a little bit simpler (after we reduce the radius enough times, the problem becomes trivial). To answer a query, we query every cluster, as well as one part of the remainder (partitioned by the Spherical LSH). This can be shown to work overall, in particular, we can control the overall branching and depth of the recursion (see [AR15] for the details).

For both the clusters, as well as the parts obtained from the Spherical LSH, the algorithm recurses on the obtained point subsets. The overall partitioning scheme from [AR15] can be seen as a tree, where the root corresponds to the whole dataset, and every node either corresponds to a cluster or to an application of the Spherical LSH. One nuance is that, in different parts of the tree the Spherical LSH partitioning obtains different p1,p2p_{1},p_{2} (depending on RR). Nonetheless, each time it holds that p1p2ρp_{1}\geq p_{2}^{\rho} for ρ12c21+o(1)\rho\leq\frac{1}{2c^{2}-1}+o(1). Hence, a node terminates as a leaf once its accumulated p2p_{2} (product of p2p_{2}’s of “Spherical LSH” nodes along the path from the root) drops below 1/n1/n.

We now want to argue that the above algorithm can be recast in the framework of data-dependent hashing as per Definition 1.3. We consider a subtree of the overall tree that contains the root and is defined as follows. Fix a parameter l=no(1)l=n^{-o(1)} (it is essentially the target p2p_{2} of the partition). We perform DFS of the tree and cut the tree at any “Spherical LSH” node where the cumulative p2p_{2} drops below ll. This subtree gives a partial partition of the dataset PP as follows: for “Spherical LSH” nodes we just apply the corresponding partition, and for cluster nodes we “carve” them in the random order. It turns out that if we choose l=no(1)l=n^{-o(1)} carefully, the partition will satisfy Definition 1.3 and the preconditions of Theorem 1.4. In particular, the description complexity of the resulting hash function is no(1)n^{o(1)}.

Let us emphasize that, while Definition 1.3 is certainly necessary for an ANN data structure based on data-dependent hashing, it is not sufficient. In fact, [AR15] prove additional properties of the above partitioning scheme, essentially because the “p2p_{2} property” is “on average” one (thus, we end up having to understand how this partitioning scheme treats triples of points).