CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing Rajeev Motwani
CS 361A 2 Metric Space Metric Space (M,D) –For points p,q in M, D(p,q) is distance from p to q –only reasonable model for high-dimensional geometric space Defining Properties –Reflexive: D(p,q) = 0 if and only if p=q –Symmetric: D(p,q) = D(q,p) –Triangle Inequality: D(p,q) is at most D(p,r)+D(r,q) Interesting Cases –M points in d-dimensional space –D Hamming or Euclidean L p -norms
CS 361A 3 High-Dimensional Near Neighbors Nearest Neighbors Data Structure –Given – N points P={p 1, …, p N } in metric space (M,D) –Queries – “Which point p P is closest to point q?” –Complexity – Tradeoff preprocessing space with query time Applications –vector quantization –multimedia databases –data mining –machine learning –…
CS 361A 4 Known Results Query Time StorageTechniquePaper dN Brute-Force 2 d log NN 2^d+1 Voronoi DiagramDobkin-Lipton 76 D d/2 log NN d/2 Random SamplingClarkson 88 d 5 log NNdNd CombinationMeiser 93 log d-1 NN log d-1 N Parametric SearchAgarwal-Matousek 92 Some expressions are approximate Bottom-line – exponential dependence on d
CS 361A 5 Approximate Nearest Neighbor Exact Algorithms –Benchmark – brute-force needs space O(N), query time O(N) –Known Results – exponential dependence on dimension –Theory/Practice – no better than brute-force search Approximate Near-Neighbors –Given – N points P={p 1, …, p N } in metric space (M,D) –Given – error parameter >0 –Goal – for query q and nearest-neighbor p, return r such that Justification –Mapping objects to metric space is heuristic anyway –Get tremendous performance improvement
CS 361A 6 Results for Approximate NN Query TimeStorageTechniquePaper d d e -d dN Balanced TreesArya et al 94 d 2 polylog(N,d) N N 2d dN polylog(N,d) Random ProjectionKleinberg 97 log 3 NN 1/ ^2 Search Trees + Dimension Reduction Indyk-Motwani 98 dN 1/ log 2 NN 1+1/ log N Locality-Sensitive Hashing Indyk-Motwani 98 External Memory Locality-Sensitive Hashing Gionis-Indyk- Motwani 99 Will show main ideas of last 3 results Some expressions are approximate
CS 361A 7 Approximate r-Near Neighbors Given – N points P={p 1,…,p N } in metric space (M,D) Given – error parameter >0, distance threshold r>0 Query –If no point p with D(q,p)<r, return FAILURE –Else, return any p’ with D(q,p’)< (1+ )r Application –Solving Approximate Nearest Neighbor –Assume maximum distance is R –Run in parallel for –Time/space – O(log R) overhead –[Indyk-Motwani] – reduce to O(polylog n) overhead
CS 361A 8 Hamming Metric Hamming Space –Points in M: bit-vectors {0,1} d (can generalize to {0,1,2,…,q} d ) –Hamming Distance: D(p,q) = # of positions where p,q differ Remarks –Simplest high-dimensional setting –Still useful in practice –In theory, as hard (or easy) as Euclidean space –Trivial in low dimensions Example –Hypercube in d=3 dimensions –{000, 001, 010, 011, 100, 101, 110, 111}
CS 361A 9 Dimensionality Reduction Overall Idea –Map from high to low dimensions –Preserve distances approximately –Solve Nearest Neighbors in new space –Performance improvement at cost of approximation error Mapping? –Hash function family H = {H 1, …, H m } –Each H i : {0,1} d {0,1} t with t<<d uniformly at random –Pick H R from H uniformly at random each point in using same –Map each point in P using same H R –Solve NN problem on H R (P) = {H R (p 1 ), …, H R (p N )}
CS 361A 10 Reduction for Hamming Spaces Theorem: For any r and small >0, there is hash family H such that for any p,q and random H R H with probability >1- , provided for some constant C, c a b c a b
CS 361A 11 Remarks For fixed threshold r, can distinguish between –Near D(p,q) < r –Far D(p,q) > (1+ε)r For N points, need Yet, can reduce to O(log N)-dimensional space, while approximately preserving distances Works even if points not known in advance
CS 361A 12 Hash Family Projection Function –Let S be ordered, multiset of s indexes from {1,…,d} –p|S:{0,1} d {0,1} s projects p into s-dimensional subspace –Example d=5, p=01100 s=3, S={2,2,4} p|S = 110 Choosing hash function H R in H –Repeat for i=1,…,t Pick S i randomly (with replacement) from {1…d} Pick random hash function f i :{0,1} s {0,1} h i (p)=f i (p|S i ) –H R (p) = (h 1 (p), h 2 (p),…,h t (p)) Remark – note similarity to Bloom Filters
CS 361A 13 Illustration of Hashing d s1 s..... p p|S 1 p|S t 0110 f1f1 ftft h 1 (p)... h t (p) H R (p)
CS 361A 14 Analysis I Choose random index-set S Claim: For any p,q Why? –p,q differ in D(p,q) bit positions –Need all s indexes of S to avoid these positions –Sampling with replacement from {1, …,d}
CS 361A 15 Analysis II Choose s=d/r Since 1-x<e -x for |x|<1, we obtain Thus
CS 361A 16 Analysis III Recall h i (p)=f i (p|S i ) Thus Choosing c= ½ (1-e -1 )
CS 361A 17 Analysis IV Recall H R (p)=(h 1 (p),h 2 (p),…,h t (p)) D(H R (p),H R (q)) = number of i’s where h i (p), h i (q) differ By linearity of expectations Theorem almost proved For high probability bound, need Chernoff Bound
CS 361A 18 Chernoff Bound Consider Bernoulli random variables X 1,X 2, …, X n –Values are 0-1 –Pr[X i =1] = x and Pr[X i =0] = 1-x Define X = X 1 +X 2 +…+X n with E[X]=nx Theorem: For independent X 1,…, X n, for any 0< <1, 2 nx P X nx
CS 361A 19 Analysis V Define –X i =0 if h i (p)=h i (q), and 1 otherwise –n=t –Then X = X 1 +X 2 +…+X t = D(H R (p),H R (q)) Case 1 [D(p,q)<r x=c] Case 2 [D(p,q)>(1+ε)r x=c+ε/6] Observe – sloppy bounding of constants in Case 2
CS 361A 20 Putting it all together Recall Thus, error probability Choosing C=1200/c Theorem is proved!!
CS 361A 21 Algorithm I Set error probability Select hash H R and map points p H R (p) Processing query q –Compute H R (q) –Find nearest neighbor H R (p) for H R (q) –If then return p, else FAILURE Remarks –Brute-force for finding H R (p) implies query time –Need another approach for lower dimensions
CS 361A 22 Algorithm II Fact – Exact nearest neighbors in {0,1} t requires –Space O(2 t ) –Query time O(t) How? –Precompute/store answers to all queries –Number of possible queries is 2 t Since Theorem – In Hamming space {0,1} d, can solve approximate nearest neighbor with : –Space –Query time
CS 361A 23 Different Metric Many applications have “sparse” points –Many dimensions but few 1’s –Example – points documents, dimensions words –Better to view as “sets” Previous approach would require large s For sets A,B, define Observe –A=B sim(A,B)=1 –A,B disjoint sim(A,B)=0 Question – Handling D(A,B)=1-sim(A,B) ?
CS 361A 24 Min-Hash Random permutations p 1,…,p t of universe (dimensions) Define mapping h j (A)=min a in A p j (a) Fact: Pr[h j (A)= h j (B)] = sim(A,B) Proof? – already seen!! Overall hash-function H R (A) = (h 1 (A), h 2 (A),…,h t (A))
CS 361A 25 Min-Hash Analysis Select Hamming Distance –D(H R (A),H R (B)) number of j’s such that Theorem For any A,B, Proof? – Exercise (apply Chernoff Bound) Obtain – ANN algorithm similar to earlier result
CS 361A 26 Generalization Goal –abstract technique used for Hamming space –enable application to other metric spaces –handle Dynamic ANN Dynamic Approximate r-Near Neighbors –Fix – threshold r –Query – if any point within distance r of q, return any point within distance –Allow insertions/deletions of points in P Recall – earlier method required preprocessing all possible queries in hash-range-space…
CS 361A 27 Locality-Sensitive Hashing Fix – metric space (M,D), threshold r, error Choose – probability parameters Q 1 > Q 2 >0 Definition – Hash family H={h:M S} for (M,D) is called. -sensitive, if for random h and for any p,q in M Intuition –p,q are near likely to collide –p,q are far unlikely to collide
CS 361A 28 Examples Hamming Space M={0,1} d –point p=b 1 …b d –H = {h i (b 1 …b d )=b i, for i=1…d} –sampling one bit at random –Pr[h i (q)=h i (p)] = 1 – D(p,q)/d Set Similarity D(A,B) = 1 – sim(A,B) –Recall –H = –Pr[h(A)=h(B)] = 1 – D(A,B)
CS 361A 29 Multi-Index Hashing Overall Idea –Fix LSH family H –Boost Q 1, Q 2 gap by defining G = H k –Using G, each point hashes into l buckets Intuition –r-near neighbors likely to collide –few non-near pairs in any bucket Define –G = { g | g(p) = h 1 (p)h 2 (p)…h k (p) } –Hamming metric sample k random bits
CS 361A 30 Example ( l =4) p q r g1g1 g2g2 g3g3 g4g4 h1h1 hkhk ……
CS 361A 31 Overall Scheme Preprocessing –Prepare hash table for range of G –Select l hash functions g 1, g 2, …, g l Insert(p) – add p to buckets g 1 (p), g 2 (p), …, g l (p) Delete(p) – remove p from buckets g 1 (p), g 2 (p), …, g l (p) Query(q) –Check buckets g 1 (q), g 2 (q), …, g l (q) –Report nearest of (say) first 3 l points Complexity –Assume – computing D(p,q) needs O(d) time –Assume – storing p needs O(d) space –Insert/Delete/Query Time – O(d l k) –Preprocessing/Storage – O(dN+N l k)
CS 361A 32 Collision Probability vs. Distance r r Q1Q1 Q2Q2 1 0 r r
CS 361A 33 Multi-Index versus Error Set l =N z where Theorem For l =N z, any query returns r-near neighbor correctly with probability at least 1/6. Consequently (ignoring k=O(log N) factors) –Time O(dN z ) –Space O(N 1+z ) –Hamming Metric –Boost Probability – use several parallel hash-tables
CS 361A 34 Analysis Define (for fixed query q) –p* – any point with D(q,p*) < r –FAR(q) – all p with D(q,p) > (1+ )r –BUCKET(q,j) – all p with g j (p) = g j (q) –Event E size : ( query cost bounded by O(d l )) –Event E NN : g j (p*) = g j (q) for some j ( nearest point in l buckets is r-near neighbor) Analysis –Show: Pr[E size ] = x > 2/3 and Pr[E NN ] = y > 1/2 –Thus: Pr[not(E size & E NN) ] < (1-x) + (1-y) < 5/6
CS 361A 35 Analysis – Bad Collisions Choose Fact Clearly Markov Inequality – Pr[X>r.E[X]] 0 Lemma 1
CS 361A 36 Analysis – Good Collisions Observe Since l =n z Lemma 2 Pr[E NN ] >1/2
CS 361A 37 Euclidean Norms Recall –x=(x 1, x 2, …, x d ) and y=(y 1, y 2, …, y d ) in R d –L 1 -norm –L p -norm (for p>1)
CS 361A 38 Extension to L 1 -Norm Round coordinates to {1,…M} Embed L 1 -{1,…,M} d into Hamming-{0,1} dM Unary Mapping Apply algorithm for Hamming Spaces –Error due to rounding of 1/M –Space-Time Overhead due to mapping of d dM
CS 361A 39 Extension to L 2 -Norm Observe –Little difference in L 1 -norm and L 2 -norm for high d –Additional error is small More generally – L p, for 1 p 2 –[Figiel et al 1977, Johnson-Schechtman 1982] –Can embed L p into L 1 –Dimensions d O(d) –Distances preserved within factor (1+a) –Key Idea – random rotation of space
CS 361A 40 Improved Bounds [Indyk-Motwani 1998] –For any L p -norm –Query Time – O(log 3 N) –Space – Problem – impractical Today – only a high-level sketch
CS 361A 41 Better Reduction Recall –Reduced Approximate Nearest Neighbors to Approximate r-Near Neighbors –Space/Time Overhead – O(log R) –R = max distance in metric space Ring-Cover Trees –Removed dependence on R –Reduced overhead to O(polylog N)
CS 361A 42 Approximate r-Near Neighbors Idea –Impose regular-grid on R d –Decompose into cubes of side length s –Label cubes with points at distance <r Data Structure –Query q – determine cube containing q –Cube labels – candidate r-near neighbors Goals –Small s lower error –Fewer cubes smaller storage
CS 361A 43 p1p1 p2p2 p3p3
CS 361A 44 Grid Analysis Assume r=1 Choose Cube Diameter = Number of cubes = Theorem – For any L p -norm, can solve Approx r-Near Neighbor using –Space – –Time – O(d)
CS 361A 45 Dimensionality Reduction [Johnson-Lindenstraus 84, Frankl-Maehara 88] For, can map points in P into subspace of dimension while preserving all inter-point distances to within a factor Proof idea – project onto random lines Result for NN –Space – –Time – O(polylog N)
CS 361A 46 References Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality P. Indyk and R. Motwani STOC 1998 Similarity Search in High Dimensions via Hashing A. Gionis, P. Indyk, and R. Motwani VLDB 1999