Locality-sensitive hashing and its applications

Locality-sensitive hashing and its applications
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Locality-sensitive hashing and its applications Paolo Ferragina University of Pisa ACM Kanellakis Award 2013

A frequent issue Given U users, described with a set of d features, the goal is to find (the largest) group of similar users Features = Personal data, preferences, purchases, navigational behavior, search behavior, followers/ing,… A feature is typically a numerical value: binary or real Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1] Users could also be Web pages (dedup), products (recommendation), tweets/news/search results (visualization) 0.3 0.7 0.1

Solution #1 Try all groups of users and, for each group, check the (average) similarity among all its users. # Sim computations  2U  U2 In the case of Facebook this is > 21billion  (109)2 If we limit groups to have a size  L users # Sim computations  UL  L2 (Even if 1ns/sim and L=10, it is > (109)10 /109 secs > 1070 years) No faster CPU/GPU, multi-cores,… could help !

Solution #2: introduce approximation
Interpret every user as a point in a d-dim space, and then apply a clustering algorithm Pick K=2 centroids at random f1 K-means Compute clusters Re-determine centroids x Re-compute clusters x Re-determine centroids Re-compute clusters Converged! Each iteration takes  K  U computations of Sim f2

Solution #2: few considerations
Cost per iteration = K  U, #iterations is typically small What about optimality ? It is locally optimal [recently, some researchers showed how to introduce some guarantee] What about the Sim-cost ? Comparing users/points costs Q(d) in time and space [notice that d may be bi/millions] What about K ? Iterate K=1, …, U it costs U3 < UL [years] In T time, we can manage U = T1/3 users Using s-faster CPU ≈ using sT time an old CPU  we can manage (s*T)1/3 = s1/3 T1/3 users

Solution #3: introduce randomization
Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints. It is randomized, correct with high probability It guarantees local access to data, which is good for speed in disk/distributed setting Questo lo si potrebbe implementare facendo sorting, invece che accedendo a tutti i bucket in modo random. Ha si il termine logaritmico, ma è piccolissimo. ACM Kanellakis Award 2013

h(p) = projection of vector p on I’s coordinates
A warm-up problem Consider vectors p,q of d binary features Hamming distance D(p,q)= #bits where p and q differ Define hash function h by choosing a set I of k random coordinates h(p) = projection of vector p on I’s coordinates Example: Pick I={1,4} (k=2), then h(p=01011) =01

A key property k=2 k=4 distance Pr Larger k
p versus q Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d We can vary the probability by changing k 1 2 …. d # = D(p,q) # = d - D(p,q) = Sk where s is the similarity between p and q k=2 k=4 distance Pr Larger k Smaller False Positive What about False Negatives?

Smaller False Negatives
Larger L Smaller False Negatives Reiterate L times Repeat L times the k-projections hi(p) We set g(p) = < h1(p), h2(p), …, hL(p)> Declare «p matches q» if at least one hi(p)=hi(q) Sketch(p) Example: We set k=2, L=3, let p = and q = 01101 I1 = {3,4}, we have h1(p) = 00 and h1(q)=10 I2 = {1,3}, we have h2(p) = 00 and h2(q)=01 I3 = {1,5}, we have h3(p) = 01 and h3(q)=01 p and q declared to match !!

Measuring the error probability
The g() consists of L independent hashes hi Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L] s Pr (1/L)^(1/k)

The case: Groups of similar items
Buckets provide the candidate similar items «Merge» similar sets over L rounds if they share items q h1(p) h2(p) hL(p) TL T2 T1 p If p ≈ q, then they fall in at least one same bucket No Tables !  SORT p,q,… q,z…

The case of on-line query
Given a query w, find the similar indexed vectors: check the vectors in the buckets hj(w) for all j=1,…, L h1(w) h2(w) hL(w) TL T2 T1 w p,z,t p,q r,q

LSH versus K-means What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability] What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors] What about the cost per iteration? Typically K-means requires few iterations, each costs K  U  d [LSH sorts U short items, few scans] What about K ? In principle have to iterate K=1,…, U [LSH does not need to know the number of clusters] You could apply K-means over LSH-sketch vectors !!

Document duplication (exact or approximate)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Document duplication (exact or approximate)

Duplicate documents The web is full of duplicated content
Sec. 19.6 Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page

Exact-Duplicate Detection
Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes relatively slow Karp-Rabin’s Scheme Rolling hash: split doc in many pieces Algebraic technique – arithmetic on primes Efficient and other nice properties…

Karp-Rabin Fingerprints
Consider – m-bit string A = 1 a1 a2 … am Basic values: Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264) Fingerprints: f(A) = A mod p Nice property is that if B = a2 … am am+1 f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p Prob[false hit btw A vs B] = Prob p divides (A-B) = #div(A-B)/ #prime(U) ≈ (log (A+B)) / #prime(U) = m log U/U

Near-Duplicate Detection
Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]

Shingling: from docs to sets of shingles
dissect document into q-grams (shingles) T = I leave and study in Pisa, …. If we set q=3 the 3-grams are: <I leave and><leave and study><and study in><study in Pisa>… represent documents by sets of hashes/shingles SB SA Doc A Doc B The near-duplicate document detection problem reduces to set intersection among int (shingles)

Desiderata once sketch computed, source is unavailable
Storage: only small sketches of each document. Computation: the fastest possible Stream Processing: once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
More applications

Sets & Jaccard similarity
SB SA Set similarity  Jaccard similarity

Compute Jaccard-sim(SA, SB)
Sec. 19.6 Compute Jaccard-sim(SA, SB) Set A Set B 264 264 a x + b mod 264 264 264 permuted 264 264 minimum a b 264 264 Are these equal? Lemma: Prob[a=b] is exactly the Jaccard-sim(SA, SB) Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance btw array of integers!

Cosine distance btw p and q
cos(a) = p  q / ||p|| * ||q|| Cosine distance btw p and q Construct a random hyperplane r of d-dim and unit norm Sketch of a vector p is hr(p)=sign(p  r) = ±1 Sketch of a vector q is hr(q)=sign(q  r) = ±1 Lemma: Other distances

The main theorem  It is correct with probability ≈ 0.3
Do exist nowadays many variants and improvements! The main theorem Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then… Set k = (log n) / (log 1/p2) L=nr, with r = (ln p1 / ln p2 ) < 1 the LSH-construction described before guarantees Extra space ≈ nL = n1+r fingeprints, of size k Query time ≈ L = nr buckets accessed  It is correct with probability ≈ 0.3 Repeating 1/d times the LSH-construction described before the success probability becomes 1-d.

Locality-sensitive hashing and its applications

Similar presentations

Presentation on theme: "Locality-sensitive hashing and its applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Locality-sensitive hashing and its applications

Similar presentations

Presentation on theme: "Locality-sensitive hashing and its applications"— Presentation transcript:

Similar presentations

About project

Feedback