Locality-sensitive hashing and its applications

Slides:

Advertisements

Similar presentations

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.

Advertisements

Lecture outline Nearest-neighbor search in low dimensions

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Searching on Multi-Dimensional Data

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

1 Lecture 18 Syntactic Web Clustering CS

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

Cluster Analysis (1).

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Locality Sensitive Hashing Basics and applications.

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

KNN & Naïve Bayes Hongning Wang

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Packing to fewer dimensions

New Characterizations in Turnstile Streams with Applications

Database Management System

Near Duplicate Detection

Lecture 22: Linearity Testing Sparse Fourier Transform

Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.

Hashing Alexandra Stefan.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Locality-sensitive hashing and its applications

Streaming & sampling.

Hashing Alexandra Stefan.

Chapter 12: Query Processing

Theory of Locality Sensitive Hashing

Sublinear Algorithmic Tools 3

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

LSI, SVD and Data Management

Packing to fewer dimensions

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Quicksort analysis Bubble sort

Index Construction: sorting

Locality Sensitive Hashing

Hashing Alexandra Stefan.

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS5112: Algorithms and Data Structures for Applications

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Compact routing schemes with improved stretch

Minwise Hashing and Efficient Search

Packing to fewer dimensions

President’s Day Lecture: Advanced Nearest Neighbor Search

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.

Three Essential Techniques for Similar Documents

CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.

Presentation transcript:

Locality-sensitive hashing and its applications Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Locality-sensitive hashing and its applications Paolo Ferragina University of Pisa ACM Kanellakis Award 2013

A frequent issue Given U users, described with a set of d features, the goal is to find (the largest) group of similar users Features = Personal data, preferences, purchases, navigational behavior, search behavior, followers/ing,… A feature is typically a numerical value: binary or real Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1] 010001110 000110010 100110010 Users could also be Web pages (dedup), products (recommendation), tweets/news/search results (visualization) 0.3 0.7 0.1

Solution #1 Try all groups of users and, for each group, check the (average) similarity among all its users. # Sim computations  2U  U2 In the case of Facebook this is > 21billion  (109)2 If we limit groups to have a size  L users # Sim computations  UL  L2 (Even if 1ns/sim and L=10, it is > (109)10 /109 secs > 1070 years) No faster CPU/GPU, multi-cores,… could help !

Solution #2: introduce approximation Interpret every user as a point in a d-dim space, and then apply a clustering algorithm Pick K=2 centroids at random f1 K-means Compute clusters Re-determine centroids x Re-compute clusters x Re-determine centroids Re-compute clusters Converged! Each iteration takes  K  U computations of Sim f2

Solution #2: few considerations Cost per iteration = K  U, #iterations is typically small What about optimality ? It is locally optimal [recently, some researchers showed how to introduce some guarantee] What about the Sim-cost ? Comparing users/points costs Q(d) in time and space [notice that d may be bi/millions] What about K ? Iterate K=1, …, U it costs U3 < UL [years] In T time, we can manage U = T1/3 users Using s-faster CPU ≈ using sT time an old CPU  we can manage (s*T)1/3 = s1/3 T1/3 users

Solution #3: introduce randomization Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints. It is randomized, correct with high probability It guarantees local access to data, which is good for speed in disk/distributed setting Questo lo si potrebbe implementare facendo sorting, invece che accedendo a tutti i bucket in modo random. Ha si il termine logaritmico, ma è piccolissimo. ACM Kanellakis Award 2013

h(p) = projection of vector p on I’s coordinates A warm-up problem Consider vectors p,q of d binary features Hamming distance D(p,q)= #bits where p and q differ Define hash function h by choosing a set I of k random coordinates h(p) = projection of vector p on I’s coordinates Example: Pick I={1,4} (k=2), then h(p=01011) =01

A key property k=2 k=4 distance Pr Larger k p versus q Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d We can vary the probability by changing k 1 2 …. d # = D(p,q) # = d - D(p,q) = Sk where s is the similarity between p and q k=2 k=4 distance Pr Larger k Smaller False Positive What about False Negatives?

Smaller False Negatives Larger L Smaller False Negatives Reiterate L times Repeat L times the k-projections hi(p) We set g(p) = < h1(p), h2(p), …, hL(p)> Declare «p matches q» if at least one hi(p)=hi(q) Sketch(p) Example: We set k=2, L=3, let p = 01001 and q = 01101 I1 = {3,4}, we have h1(p) = 00 and h1(q)=10 I2 = {1,3}, we have h2(p) = 00 and h2(q)=01 I3 = {1,5}, we have h3(p) = 01 and h3(q)=01 p and q declared to match !!

Measuring the error probability The g() consists of L independent hashes hi Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L] s Pr (1/L)^(1/k)

The case: Groups of similar items Buckets provide the candidate similar items «Merge» similar sets over L rounds if they share items q h1(p) h2(p) hL(p) TL T2 T1 p If p ≈ q, then they fall in at least one same bucket No Tables !  SORT p,q,… q,z…

The case of on-line query Given a query w, find the similar indexed vectors: check the vectors in the buckets hj(w) for all j=1,…, L h1(w) h2(w) hL(w) TL T2 T1 w p,z,t p,q r,q

LSH versus K-means What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability] What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors] What about the cost per iteration? Typically K-means requires few iterations, each costs K  U  d [LSH sorts U short items, few scans] What about K ? In principle have to iterate K=1,…, U [LSH does not need to know the number of clusters] You could apply K-means over LSH-sketch vectors !!

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" More applications

Sets & Jaccard similarity SB SA Set similarity  Jaccard similarity

Compute Jaccard-sim(SA, SB) Sec. 19.6 Compute Jaccard-sim(SA, SB) Set A Set B 264 264 a x + b mod 264 264 264 permuted 264 264 minimum a b 264 264 Are these equal? Lemma: Prob[a=b] is exactly the Jaccard-sim(SA, SB) Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance !

Cosine distance btw p and q cos(a) = p  q / ||p|| * ||q|| Cosine distance btw p and q Construct a random hyperplane r of d-dim and unit norm Sketch of a vector p is hr(p)=sign(p  r) = ±1 Sketch of a vector q is hr(q)=sign(q  r) = ±1 Lemma: Other distances

The main theorem  It is correct with probability ≈ 0.3 Do exist nowadays many variants and improvements! The main theorem Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then… Set k = (log n) / (log 1/p2) L=nr, with r = (ln p1 / ln p2 ) < 1 the LSH-construction described before guarantees Extra space ≈ nL = n1+r fingeprints, of size k Query time ≈ L = nr buckets accessed  It is correct with probability ≈ 0.3 Repeating 1/d times the LSH-construction described before the success probability becomes 1-d.