Similarity Search in High Dimensions via Hashing

Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun

Outline Introduction Problem Description Key Idea
Experiments and Results Conclusions

Introduction Similarity Search over High-Dimensional Data
Image databases, document collections etc Curse of Dimensionality All space partitioning techniques degrade to linear search for high dimensions Exact vs. Approximate Answer Approximate might be good-enough and much-faster Time-quality trade-off

Problem Description  - Nearest Neighbor Search ( - NNS)
Given a set P of points in a normed space , preprocess P so as to efficiently return a point p  P for any given query point q, such that dist(q,p)  (1 +  )  min r  P dist(q,r) Generalizes to K- nearest neighbor search ( K >1)

Problem Description

Key Idea Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data Preprocessing : Hash the data-point using several LSH functions so that probability of collision is higher for closer objects

Algorithm : Preprocessing
Input Set of N points { p1 , …….. pn } L ( number of hash tables ) Output Hash tables Ti , i = 1 , 2, …. L Foreach i = 1 , 2, …. L Initialize Ti with a random hash function gi(.) Foreach j = 1 , 2, …. N Store point pj on bucket gi(pj) of hash table Ti

LSH - Algorithm P pi g1(pi) g2(pi) gL(pi) T1 T2 TL

Algorithm :  - NNS Query
Input Query point q K ( number of approx. nearest neighbors ) Access Hash tables Ti , i = 1 , 2, …. L Output Set S of K ( or less ) approx. nearest neighbors S   Foreach i = 1 , 2, …. L S  S  { points found in gi(q) bucket of hash table Ti }

LSH - Analysis Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} dist(p,q) < r1  ProbH [h(q) = h(p)]  p1 dist(p,q)  r2  ProbH [h(q) = h(p)]  p2 p1 > p2 and r1 < r2 LSH functions: gi(.) = { h1(.) …hk(.) } For a proper choice of k and l, a simpler problem, (r,)-Neighbor, and hence the actual problem can be solved Query Time : O(d n[1/(1+)] ) d : dimensions , n : data size

Experiments Data Sets Color images from COREL Draw library (20,000 points,dimensions up to 64) Texture information of aerial photographs (270,000 points, dimensions 60) Evaluation Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values Compare Performance with SR-Tree ( Spatial Data Structure )

Performance Measures Speed Miss Ratio Error
Number of disk block accesses in order to answer the query ( # hash tables) Miss Ratio Fraction of cases when less than K points are found for K-NNS Error Average of fractional error in distance to point found by LSH as compared to nearest neighbor distance taken over entire set of queries

Speed vs. Data Size

Speed vs. Dimension

Speed vs. Nearest Neighbors

Speed vs. Error

Miss Ratio vs. Data Size

Conclusion Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average

Future Work Investigate Hybrid-Data Structures obtained by merging tree and hash based structures. Make use of the structure of the data-set to systematically obtain LSH functions Explore other applications of LSH-type techniques to data mining

Questions ?

Similarity Search in High Dimensions via Hashing

Similar presentations

Presentation on theme: "Similarity Search in High Dimensions via Hashing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similarity Search in High Dimensions via Hashing

Similar presentations

Presentation on theme: "Similarity Search in High Dimensions via Hashing"— Presentation transcript:

Similar presentations

About project

Feedback