Optimal Data-Dependent Hashing for Approximate Near Neighbors

Alexandr Andoni, Ilya Razenshteyn

We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an nn-point data set in a dd-dimensional space our data structure achieves query time O(dnρ+o(1))O(d n^{ρ+o(1)}) and space O(n1+ρ+o(1)+dn)O(n^{1+ρ+o(1)} + dn), where ρ=12c21ρ=\tfrac{1}{2c^2-1} for the Euclidean space and approximation c>1. For the Hamming space, we obtain an exponent of ρ=12c1ρ=\tfrac{1}{2c-1}. Our result completes the direction set forth in [AINR14] who gave a proof-of-concept that data-dependent hashing can outperform classical Locality Sensitive Hashing (LSH). In contrast to [AINR14], the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures [IM98,AI06] for all approximation factors c>1. From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random.