Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Slides:



Advertisements
Similar presentations
Lower Bounds for Local Search by Quantum Arguments Scott Aaronson (UC Berkeley) August 14, 2003.
Advertisements

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Lecture outline Nearest-neighbor search in low dimensions
k-Nearest Neighbors Search in High Dimensions
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Analysis of Algorithms
Nearest Neighbor Finding Using Kd-tree Ref: Andrew Moore’s PhD thesis (1991)Andrew Moore’s PhD thesis.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.
1 Lecture 18 Syntactic Web Clustering CS
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
A survey on stream data mining
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Approximate Nearest Subspace Search with applications to pattern recognition Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech.
FLANN Fast Library for Approximate Nearest Neighbors
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Module 04: Algorithms Topic 07: Instance-Based Learning
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
KNN & Naïve Bayes Hongning Wang
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Sublinear Algorithmic Tools 3
Orthogonal Range Searching and Kd-Trees
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Lecture 10: Sketching S3: Nearest Neighbor Search
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Linear sketching with parities
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Linear sketching over
Instance Based Learning
cse 521: design and analysis of algorithms
Linear sketching with parities
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech
Presentation transcript:

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009

Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion Indyk and Motwani, 1998 Gionis, Indyk and Motwani, 1999 Main Results

Nearest Neighbor Problem Input: A set P of points in R d (or any metric space). Output: Given a query point q, find the point p * in P which is closest to q. q p*

What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression

What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression Feature space query

What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression about boat bat abate able scout shout abaut Feature space query

What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression And many more…

Approximate Nearest Neighbor  -NN

Input: A set P of points in R d (or any metric space). Given a query point q, let: –p * point in P closest to q –r* the distance ||p*-q|| Output: Some point p’ with distance at most r*(1+  ) q p* r*

Approximate Nearest Neighbor  -NN Input: A set P of points in R d (or any metric space). Given a query point q, let: –p * point in P closest to q –r* the distance ||p*-q|| Output: Some point p’ with distance at most r*(1+  ) p* ·r*(1+  ) q r*

Approximate vs. Exact Nearest Neighbor Many applications give similar results with approximate NN Example from Computer Vision

Retiling Slide from Lihi Zelnik-Manor

Exact NNS ~27 sec Approximate NNS ~0.6 sec Slide from Lihi Zelnik-Manor

Solution Method Input: A set P of n points in R d. Method: Construct a data structure to answer nearest neighbor queries Complexity –Preprocessing: space and time to construct the data structure –Query: time to return answer

Solution Method Naïve approach: –Preprocessing O(nd) –Query time O(nd) Reasonable requirements: –Preprocessing time and space poly(nd). –Query time sublinear in n.

Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

Classical nearest neighbor methods Tree structures –kd-trees Vornoi Diagrams –Preprocessing poly(n), exp(d) –Query log(n), exp(d) Difficult problem in high dimensions –The solutions still work, but are exp(d)…

KD-tree d=1 (binary search tree) 520 7, 810, 1213, ,15,187,8,10, ,15 10,127,8

KD-tree d=1 (binary search tree) 520 7, 810, 1213, ,15,187,8,10, ,15 10,127,8 17 query min dist = 1

KD-tree d=1 (binary search tree) 520 7, 810, 1213, ,15,187,8,10, ,15 10,127,8 16 query min dist = 2 min dist = 1

KD-tree d>1: alternate between dimensions Example: d=2 x y x (12,5) (6,8) (17,4) (23,2) (20,10) (9,9) (1,6) (17,4) (23,2) (20,10) (12,5) (6,8) (1,6) (9,9)

KD-tree d>1: alternate between dimensions Example: d=2 xx y x

KD-tree d>1: alternate between dimensions Example: d=2 NN search Animated gif from

KD-tree: complexity Preprocessing O(nd) Query –O(logn) if points are randomly distributed –w.c. O(kn 1-1/k ) almost linear when n close to k Need to search the whole tree xx y x

Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

Sublinear solutions PreprocessingQuery time n O(1/  ) O(logn)Bucketing O(n 1+1/(1+  ) ) [n 3/2 when  =1] O(n 1/(1+  ) ) [sqrt(n) when  =1] LSH 2 Linear in d Not counting logn factors Solve  -NN by reduction

r-PLEB Point Location in Equal Balls Given n balls of radius r, for every query q, find a ball that it resides in, if exists. If doesn’t reside in any ball return NO. Return p 1 p1p1

r-PLEB Point Location in Equal Balls Given n balls of radius r, for every query q, find a ball that it resides in, if exists. If doesn’t reside in any ball return NO. Return NO

Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

Reduction from  -NN to r-PLEB Naïve Approach Set R=proportion between largest dist and smallest dist of 2 points Define r={(1+  ) 0, (1+  ) 1,…,R} For each r i construct r i -PLEB Given q, find the smallest r* which gives a YES –Use binary search to find r*

Reduction from  -NN to r-PLEB Naïve Approach Set R=proportion between largest dist and smallest dist of 2 points Define r={(1+  ) 0, (1+  ) 1,…,R} For each r i construct r i -PLEB Given q, find the smallest r i which gives a YES –Use binary search r 1 -PLEB r 2 -PLEB r 3 -PLEB

Reduction from  -NN to r-PLEB Naïve Approach Correctness –Stopped at r i =(1+  ) k –r i+1 =(1+  ) k+1 r 1 -PLEB r 2 -PLEB r 3 -PLEB (1+  ) k · r* · (1+  ) k+1

Reduction from  -NN to r-PLEB Naïve Approach Reduction overhead: Space: O(log 1+  R) r-PLEB constructions –Size of {(1+  ) 0, (1+  ) 1,…,R} is log 1+  R Query: O(loglog 1+  R) calls to r-PLEB Dependency on R

Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C) Har-Peled 2001

Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C)

Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C) Set r top = 4nr med logn/  r med r top

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. 2 + half of the points O(loglogR)=O(log(n/  ) Complexity overhead: how many r-PLEB queries? Total: O(logn)

(r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. p1p1 Return p 1

(r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. Return NO

(r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. Return YES or NO

Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB Indyk and Motwani, 1998

Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q r-PLEB

Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q r  /sqrt(d)  r-PLEB

Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q  NO YES YES or NOr-PLEB

Bucketing Method Complexity Space required: O(nk)=O(n(1/  d )) Query time: O(d) If d=O(logn) [or n=O(2 d )] –Space req: O(n log(1/  ) ) Else use dimensionality reduction in l 2 from d to  -2 log(n) [Johnson-Lindenstrauss lemma] –Space: n O(1/  ) 2

Break

Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Local Sensitive Hashing Conclusion

Locality Sensitive Hashing Indyk & Motwani 98, Gionis, Indyk & Motwani 99 A solution for (r,  )-PLEB. Probabilistic construction, query succeeds with high probability. Use random hash functions g: X  U (some finite range). Preserve “separation” of “near” and “far” points with high probability.

Locality Sensitive Hashing If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high” If ||p-q|| > (1+  )r, then Pr[g(p)=g(q)] is “low” r … g3g3 … g2g2 … g1g1

A locality sensitive family A family H of functions h: X → U is called (P 1,P 2,r,(1+  )r)-sensitive for metric d X, if for any p,q: –if ||p-q|| P 1 –if ||p-q|| >(1+  )r then Pr[ h(p)=h(q) ] < P 2 For this notion to be useful we require P 1 > P 2

Intuition if ||p-q|| P 1 if ||p-q|| >(1+  )r then Pr[ h(p)=h(q) ] < P 2 h1h1 h2h2 Illustration from Lihi Zelnik-Manor

Claim If there is a (P 1,P 2,r,(1+  )r) - sensitive family for d X then there exists an algorithm for (r,  )- PLEB in d X with Space - O(dn+n 1+  ) Query - O(dn  ) Where ~ When  = 1 O(dn + n 3/2 ) O(d ¢ sqrt(n))

Algorithm – preprocessing k h1h1 h2h2 hkhk For i = 1,…,L –Uniformly select k functions from H –Set g i (p)=(h 1 (p),h 2 (p),…,h k (p)) g i ( ) = (0,0,...,1) gi( ) = (1,0,…,0) h i : R d  {0,1} 0 1

Algorithm – preprocessing For i = 1,…,L –Uniformly select k functions from H –Set g i (p)=(h 1 (p),h 2 (p),…,h k (p)) –Compute g i (p) for all p 2 P –Store resulting values in a hash table

Algorithm - query S à , i à 1 While |S| · 2L –S à S [ {points in bucket g i (q) of table i} –If 9 p 2 S s.t. ||p-q|| · (1+  )r return p and exit. –i++ Return NO.

Correctness Property I: if ||q-p * || · r then g i (p * ) = g i (q) for some i 2 1,...,L Property II: number of points p 2 P s.t. ||q-p|| ¸ (1+  )r and g i (p * ) = g i (q) is less than 2L We show that Pr[I & II hold] ¸ ½-1/e

Correctness Property I: if ||q-p * || · r then g i (p * ) = g i (q) for some i 2 1,...,L Property II: number of points p 2 P s.t. ||q-p|| ¸ (1+  )r and g i (p * ) = g i (q) is less than 2L Choose: –k = log 1/p 2 n –L = n  where

Complexity k = log 1/p 2 n L = n  where Space L ¢ n + d ¢ n = O(n 1+  + dn) Query L hash function evaluations + O(L) distance calculations = O(dn  ) Hash tablesData points ~

Significance of k and L ||p-q|| Pr[g(p) = g(q)]

Significance of k and L ||p-q|| Pr[g i (p) = g i (q) for some i 2 1,...,L]

Application Perform NNS in R d with l 1 distance. Reduce the problem to NNS in H d’ the hamming cube of dimension d’. H d’ = binary strings of length d’. d Ham (s 1,s 2 ) = number of coordinates where s 1 and s 2 disagree.

w.l.o.g all coordinates of all points in P are positive integer < C. Map integer i 2 {1,...,C} to (1,1,....,1,0,0,...0) Map a vector by mapping each coordinate. Example: {(5,3,2),(2,4,1)}  {(11111,11100,11000),(11000,11110,10000)} Embedding l 1 d in H d’ C-i zerosi ones

Distances are preserved. Actual computations are performed in the original space O(log C) overhead. Embedding l 1 d in H d’

A sensitive family for the hamming cube H d’ = {h i : h i (b 1,…,b d’ ) = b i for i = 1,…,d’} –If d Ham (s 1,s 2 ) < r what is Pr[h(p)=h(q)] ? at most 1-r/d’ –If d Ham (s ,s 2 ) > (1+  )r what is Pr[h(p)=h(q)] ? at least 1-(1+  )r/d’ H d’ is (r,(1+  )r,1-r/d’,1-(1+  )r/d’) sensitive. Question: what are these projections in the original space?

Corollary We can bound · (1/1+  ) Space - O(dn+n (1+1/(1+  ) Query - O(dn 1/(1+  ) When  = 1 O(dn + n 3/2 ) O(d ¢ sqrt(n))

Recent results In Euclidian space –  · 1/(1+  ) 2 + O(log log n / log 1/3 n) [Andoni & Indyk 2008] –  ¸ 0.462/(1+  ) 2 [Motwani, Naor & Panigrahy 2006] LSH family for l s s 2 [0,2) [Datar,Immorlica,Indyk & Mirrokni 2004] And many more.

Conclusion NNS is an important problem with many applications. The problem can be efficiently solved in low dimensions. We saw some efficient approximate solutions in high dimensions, which are applicable to many metrics.