Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)


Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Nearest Neighbor Search (NNS)

Motivation Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule image/video/music recognition, deduplication, bioinformatics, etc… Distance can be: Hamming, Euclidean, … Primitive for other problems: find the similar pairs, clustering…

Approximate NNS c-approximate q r p cr

Locality-Sensitive Hashing q p 1 [Indyk-Motwani’98] q “ not-so-small ”

Locality sensitive hash functions 6 [Indyk-Motwani’98] 1

Algorithms and Lower Bounds SpaceTimeCommentReference [IM’98] [PTW’08, PTW’10] [IM’98] [DIIM’04, AI’06] [MNP’06] [OWZ’11] [PTW’08, PTW’10] [MNP’06] [OWZ’11]

LSH is tight… leave the rest to cell-probe lower bounds?

Main Result 9

A look at LSH lower bounds 10 [O’Donnell-Wu-Zhou’11]

Why not NNS lower bound? 11

Our algorithm: intuition 12

Nice Configuration: “sparsity” 13

Reduction: into spherical LSH 14

Two-level algorithm

Details 16

Practice Practice uses data-dependent partitions! “wherever theoreticians suggest to use random dimensionality reduction, use PCA” Lots of variants Trees: kd-trees, quad-trees, ball-trees, rp- trees, PCA-trees, sp-trees… no guarantees: e.g., are deterministic Is there a better way to do partitions in practice? Why do PCA-trees work? [Abdullah-A-Kannan-Krauthgamer]: if have more structure 17

Finale 18