Presentation is loading. Please wait.

Presentation is loading. Please wait.

Locality-sensitive hashing and its applications

Similar presentations


Presentation on theme: "Locality-sensitive hashing and its applications"— Presentation transcript:

1 Locality-sensitive hashing and its applications
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Locality-sensitive hashing and its applications Paolo Ferragina University of Pisa ACM Kanellakis Award 2013

2 A frequent issue Given U users, described with a set of d features, the goal is to find (the largest) group of similar users Features = Personal data, preferences, purchases, navigational behavior, search behavior, followers/ing,… A feature is typically a numerical value: binary or real Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1] Users could also be Web pages (dedup), products (recommendation), tweets/news/search results (visualization) 0.3 0.7 0.1

3 Solution #1 Try all groups of users and, for each group, check the (average) similarity among all its users. # Sim computations  2U  U2 In the case of Facebook this is > 21billion  (109)2 If we limit groups to have a size  L users # Sim computations  UL  L2 (Even if 1ns/sim and L=10, it is > (109)10 /109 secs > 1070 years) No faster CPU/GPU, multi-cores,… could help !

4 Solution #2: introduce approximation
Interpret every user as a point in a d-dim space, and then apply a clustering algorithm Pick K=2 centroids at random f1 K-means Compute clusters Re-determine centroids x Re-compute clusters x Re-determine centroids Re-compute clusters Converged! Each iteration takes  K  U computations of Sim f2

5 Solution #2: few considerations
Cost per iteration = K  U, #iterations is typically small What about optimality ? It is locally optimal [recently, some researchers showed how to introduce some guarantee] What about the Sim-cost ? Comparing users/points costs Q(d) in time and space [notice that d may be bi/millions] What about K ? Iterate K=1, …, U it costs U3 < UL [years] In T time, we can manage U = T1/3 users Using s-faster CPU ≈ using sT time an old CPU  we can manage (s*T)1/3 = s1/3 T1/3 users

6 Solution #3: introduce randomization
Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints. It is randomized, correct with high probability It guarantees local access to data, which is good for speed in disk/distributed setting Questo lo si potrebbe implementare facendo sorting, invece che accedendo a tutti i bucket in modo random. Ha si il termine logaritmico, ma è piccolissimo. ACM Kanellakis Award 2013

7 h(p) = projection of vector p on I’s coordinates
A warm-up problem Consider vectors p,q of d binary features Hamming distance D(p,q)= #bits where p and q differ Define hash function h by choosing a set I of k random coordinates h(p) = projection of vector p on I’s coordinates Example: Pick I={1,4} (k=2), then h(p=01011) =01

8 A key property k=2 k=4 distance Pr Larger k
p versus q Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d We can vary the probability by changing k 1 2 …. d # = D(p,q) # = d - D(p,q) = Sk where s is the similarity between p and q k=2 k=4 distance Pr Larger k Smaller False Positive What about False Negatives?

9 Smaller False Negatives
Larger L Smaller False Negatives Reiterate L times Repeat L times the k-projections hi(p) We set g(p) = < h1(p), h2(p), …, hL(p)> Declare «p matches q» if at least one hi(p)=hi(q) Sketch(p) Example: We set k=2, L=3, let p = and q = 01101 I1 = {3,4}, we have h1(p) = 00 and h1(q)=10 I2 = {1,3}, we have h2(p) = 00 and h2(q)=01 I3 = {1,5}, we have h3(p) = 01 and h3(q)=01 p and q declared to match !!

10 Measuring the error probability
The g() consists of L independent hashes hi Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L] s Pr (1/L)^(1/k)

11 The case: Groups of similar items
Buckets provide the candidate similar items «Merge» similar sets over L rounds if they share items q h1(p) h2(p) hL(p) TL T2 T1 p If p ≈ q, then they fall in at least one same bucket No Tables !  SORT p,q,… q,z…

12 The case of on-line query
Given a query w, find the similar indexed vectors: check the vectors in the buckets hj(w) for all j=1,…, L h1(w) h2(w) hL(w) TL T2 T1 w p,z,t p,q r,q

13 LSH versus K-means What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability] What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors] What about the cost per iteration? Typically K-means requires few iterations, each costs K  U  d [LSH sorts U short items, few scans] What about K ? In principle have to iterate K=1,…, U [LSH does not need to know the number of clusters] You could apply K-means over LSH-sketch vectors !!

14 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
More applications

15 Sets & Jaccard similarity
SB SA Set similarity  Jaccard similarity

16 Compute Jaccard-sim(SA, SB)
Sec. 19.6 Compute Jaccard-sim(SA, SB) Set A Set B 264 264 a x + b mod 264 264 264 permuted 264 264 minimum a b 264 264 Are these equal? Lemma: Prob[a=b] is exactly the Jaccard-sim(SA, SB) Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance !

17 Cosine distance btw p and q
cos(a) = p  q / ||p|| * ||q|| Cosine distance btw p and q Construct a random hyperplane r of d-dim and unit norm Sketch of a vector p is hr(p)=sign(p  r) = ±1 Sketch of a vector q is hr(q)=sign(q  r) = ±1 Lemma: Other distances

18 The main theorem  It is correct with probability ≈ 0.3
Do exist nowadays many variants and improvements! The main theorem Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then… Set k = (log n) / (log 1/p2) L=nr, with r = (ln p1 / ln p2 ) < 1 the LSH-construction described before guarantees Extra space ≈ nL = n1+r fingeprints, of size k Query time ≈ L = nr buckets accessed  It is correct with probability ≈ 0.3 Repeating 1/d times the LSH-construction described before the success probability becomes 1-d.

19


Download ppt "Locality-sensitive hashing and its applications"

Similar presentations


Ads by Google