CS5112: Algorithms and Data Structures for Applications

Slides:



Advertisements
Similar presentations
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Advertisements

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Lecture outline Nearest-neighbor search in low dimensions
Nonparametric Methods: Nearest Neighbors
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Linear Separators.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Clustering Prof. Ramin Zabih
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
Mining of Massive Datasets Edited based on Leskovec’s from
Mining Data Streams (Part 1)
School of Computing Clemson University Fall, 2012
SIMILARITY SEARCH The Metric Space Approach
Grassmannian Hashing for Subspace searching
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Lecture 22: Linearity Testing Sparse Fourier Transform
Finding Similar Items: Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Instructor: Lilian de Greef Quarter: Summer 2017
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Randomized Algorithms
It Probably Works Jerome Vogt.
Preprocessing to accelerate 1-NN
Locality Sensitive Hashing
cse 521: design and analysis of algorithms
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Data Mining CSCI 307, Spring 2019 Lecture 23
Lecture-Hashing.
Presentation transcript:

CS5112: Algorithms and Data Structures for Applications Lecture 18: Locality-sensitive hashing Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Today Prelim discussion Locality sensitive hashing

Prelim discussion Final will be a lot like the prelim But including material starting today Review session in class before exam Regrades are done on the entire exam Your score can go up or down Clearly you should request one if we made an obvious mistake Regrades on gradescope, open at 2PM, close a week later Staff available after lecture, informally

Prelim stats Overall people did quite well 14 students got a perfect score Hardest questions: 1.4 (collision handling data structures): 54% got this right 4.5 (last 4 had c): 45% Dynamic programming question: 72%

Curse of dimensionality High dimensions are not intuitive, and data becomes sparse Volume goes up so fast that an exponential amount of data needed “Most of the volume is in the corners” theorem Think of a 3D sphere inscribed in a box Particularly problematic for NN search Quad trees require exponential space and time k-d trees requires exponential time

Locality sensitive hashing We can use hashing to solve ‘near neighbors’ And thus nearest neighbors, if we really need this Ensuring collisions is the key idea! Make nearby points hash to the same bucket In a probabilistic manner By giving up on exactness we can overcome the curse of dimensionality This is an extremely important technique in practice

Basic idea of LSH A family of hash functions is locality sensitive if, when we pick a random element ℎ of that family, for any 2 points 𝑝,𝑞 𝑃 ℎ 𝑝 =ℎ 𝑞 is ‘high’ if 𝑝 is ‘close’ to 𝑞 𝑃 ℎ 𝑝 =ℎ 𝑞 is ‘low’ if 𝑝 is ‘far’ from 𝑞 Cool idea, assuming such hash functions exist! (they do)

How to use LSH for NN Pick 𝑙 different hash functions at random, ℎ 1 ,…, ℎ 𝑙 Put each data point 𝑝 into the 𝑙 buckets ℎ 1 (𝑝), …, ℎ 𝑙 𝑝 Now for a query point 𝑞 we look in the corresponding buckets ℎ 1 𝑞 , …, ℎ 𝑙 (𝑞) Return the closest point to 𝑞 How does the running time scale up with the dimension 𝑑? Nothing is exponential in 𝑑, which is awesome! Can prove worst case is slightly better than 𝑂(𝑛𝑑)

Do such functions exist ? Generally LSH functions use random projections Main example: consider the hypercube, i.e., Points from 0,1 𝑑 Hamming distance 𝐷(𝑝,𝑞)= # positions on which p and q differ Define hash function ℎ by choosing a set 𝑆 of 𝑘 random coordinates, and setting ℎ(𝑝) = projection of 𝑝 on 𝑆

Example Take 𝑑=10, 𝑝=0101110010, 𝑘=2, 𝑆={2,5} Then ℎ(𝑝)=11

These hash functions are locality-sensitive P⁡[ℎ(𝑝)=ℎ(𝑞)]=(1− 𝐷 𝑝,𝑞 𝑑 )𝑘 We can vary the probability by changing 𝑘 distance k=1 k=2 P

Random Projections + Binning Nearby points will often fall in same bin Faraway points will rarely fall in same bin

Other important LSH families We looked at Hamming distance via random subset projection There is a large literature on what distance functions support LSH, including lots of important cases and impossibility proofs Two particularly nice examples: Jacard similarity via MinHash Angle similarity via SimHash

Jacard similarity via min hash Given a universe of items we want to compare two subsets Need a hash function which maps to same bucket if they have a lot of overlap Hash functions are random re-ordering of the universe Hash is the minimum value of a subset under a permutation When there is a lot of overlap, the minimum value tends to be the same, and so the two subsets map to the same bucket

Angle similarity via SimHash Angle similarity via projection onto random vector VERY important for machine learning, etc. Pick a random unit vector 𝑟, and check if the two inputs 𝑥,𝑦 are on the same side of the half-space it defines Compute the dot products 〈𝑥,𝑟〉, 〈𝑦,𝑟〉 Do they have the same sign?