Theory of Locality Sensitive Hashing

Slides:



Advertisements
Similar presentations
Lecture outline Nearest-neighbor search in low dimensions
Advertisements

Similarity and Distance Sketching, Locality Sensitive Hashing
Data Mining of Very Large Data
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Association Rule Mining
My Favorite Algorithms for Large-Scale Data Mining
Distance Measures LS Families of Hash Functions S-Curves
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Finding Similar Items.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
Similarity and Distance Sketching, Locality Sensitive Hashing
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Big Data Infrastructure
Locality-sensitive hashing and its applications
CS276A Text Information Retrieval, Mining, and Exploitation
Linear Algebra Review.
Relations, Functions, and Matrices
Great Theoretical Ideas in Computer Science
15-499:Algorithms and Applications
Near Duplicate Detection
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Markov Chains Mixing Times Lecture 5
Clustering Algorithms
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Chapter 2 Sets and Functions.
Streaming & sampling.
Sketching, Locality Sensitive Hashing
Haim Kaplan and Uri Zwick
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Shingling Minhashing Locality-Sensitive Hashing
Relational Algebra Chapter 4, Part A
More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions LSH for Cosine Distance Jeffrey.
More LSH LS Families of Hash Functions LSH for Cosine Distance Special Approaches for High Jaccard Similarity Jeffrey D. Ullman Stanford University.
Lecture 11: Nearest Neighbor Search
Fitting Curve Models to Edges
Clustering CS246: Mining Massive Datasets
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Great Theoretical Ideas in Computer Science
Locality Sensitive Hashing
Searching Similar Segments over Textual Event Sequences
Copyright © Cengage Learning. All rights reserved.
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
CS 3343: Analysis of Algorithms
Near Neighbor Search in High Dimensional Data (1)
Three Essential Techniques for Similar Documents
Locality-Sensitive Hashing
Presentation transcript:

Theory of Locality Sensitive Hashing CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Recap: Finding similar documents Goal: Given a large number (N in the millions or billions) of text documents, find pairs that are “near duplicates” Application: Detect mirror and approximate mirror sites/pages: Don’t want to show both in a web search Problems: Many small pieces of one doc can appear out of order in another Too many docs to compare all pairs Docs are so large and so many that they cannot fit in main memory 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Recap: 3 Essential Steps Shingling: Convert documents to large sets of items Minhashing: Convert large sets into short signatures, while preserving similarity Locality-sensitive hashing: Focus on pairs of signatures likely to be from similar documents 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Recap: The Big Picture Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the doc- ument 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Recap: Shingles A k-shingle (or k-gram) is a sequence of k tokens that appears in the document Example: k=2; D1= abcab Set of 2-shingles: C1 = S(D1) = {ab, bc, ca} Represent a doc by the set of hash values of its k-shingles A natural document similarity measure is then the Jaccard similarity: Sim(D1, D2) = |C1C2|/|C1C2| 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Recap: Min-hashing Prob. h(C1) = h(C2) is the same as Sim(D1, D2): Pr[h(C1) = h(C2)] = Sim(D1, D2) Input matrix 1 Permutation  1 4 3 4 7 6 1 2 5 Signature matrix M 3 2 1 2 4 7 1 6 3 2 6 5 7 4 5 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Recap: LSH 1 2 4 Hash columns of signature matrix M: Similar columns likely hash to same bucket Divide matrix M into b bands of r rows Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets b bands Prob. of sharing a bucket r rows Sim. threshold s Matrix M 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

This is what 1 band gives you Recap: The S-Curve The S-curve is where the “magic” happens Remember: Probability of equal hash-values = similarity t No chance if s < t Probability = 1 if s > t Probability of sharing a bucket Similarity s of two sets This is what we want! This is what 1 band gives you Pr[h(C1) = h(C2)] = sim(D1, D2) How? By picking r and s! 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

How Do We Make the S-curve? Remember: b bands, r rows/band Columns C1 and C2 have similarity s Pick any band (r rows) Prob. that all rows in band equal = sr Prob. that some row in band unequal = 1 - sr Prob. that no band identical = (1 - s r)b Prob. that at least 1 band identical = 1 - (1 - s r)b Prob. of sharing a bucket Similarity s 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

S-curves as a Func. of b and r r=5, b=1..50 r=1..10, b=1 Prob. sharing a bucket 1 - (1 - s r)b r=10, b=1..50 Prob. sharing a bucket r=1, b=1..10 Similarity Similarity 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Theory of LSH Locality- sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity. MinHash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the doc- ument Theory of LSH

Theory of LSH We have used LSH to find similar documents In reality, columns in large sparse matrices with high Jaccard similarity e.g., customer/item purchase histories Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let’s generalize what we’ve learned! 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Families of Hash Functions For min-hash signatures, we got a min-hash function for each permutation of rows An example of a family of hash functions: A “hash function” is any function that takes two elements and says whether or not they are “equal” Shorthand: h(x) = h(y) means “h says x and y are equal” A family of hash functions is any set of hash functions A set of related hash functions generated by some mechanism We should be able to efficiently pick a hash function at random from such a family 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

JEFF’S PREVIOUS SLIDE A hash function (here only) is a function h that takes two elements and says whether or not they are “equal.” Shorthand: h(x) = h(y) means h(x, y) = “yes.” A family of hash functions is any set of hash functions from which we can pick one at random efficiently. Example: the set of minhash functions generated from permutations of rows. 9/19/2018 Jure Leskovec, Stanford C246: Mining Massive Datasets

Locality-Sensitive (LS) Families Suppose we have a space S of points with a distance measure d A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : If d(x,y) < d1, then the probability over all h  H, that h(x) = h(y) is at least p1 If d(x,y) > d2, then the probability over all h  H, that h(x) = h(y) is at most p2 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

A (d1,d2,p1,p2)-sensitive function High probability; at least p1 p1 Pr[h(x) = h(y)] p2 Low probability; at most p2 d1 d2 d(x,y) 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Example of LS Family: MinHash Let: S = sets, d = Jaccard distance, H is family of minhash functions for all permutations of rows Then for any hash function h  H: Pr[h(x)=h(y)] = 1-d(x,y) Simply restates theorem about min-hashing in terms of distances rather than similarities 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Example: LS Family – (2) If distance < 1/3 (so similarity > 2/3) Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d. For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1<d2 Theory leaves unknown what happens to pairs that are at distance between d1 and d2 Consequence: No guarantees about fraction of false positives in that range Then probability that min-hash values agree is > 2/3 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Amplifying a LS-Family Can we reproduce the “S-curve” effect we saw before for any LS family? The “bands” technique we learned for signature matrices carries over to this more general setting Two constructions: AND construction like “rows in a band” OR construction like “many bands” Prob. of sharing a bucket Similarity s 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

AND of Hash Functions Given family H, construct family H’ consisting of r functions from H For h = [h1,…,hr] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for all i Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,(p1)r,(p2)r)-sensitive Proof: Use the fact that hi ’s are independent 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Subtlety Regarding Independence Independence really means that the prob. of two hash functions saying “yes” is the product of each saying “yes” But two minhash functions (e.g.) can be highly correlated For example, if their permutations agree in the first million places However, the probabilities in the 4-tuples are over all possible members of H, H’ OK for minhash, others, but must be part of LSH-family definition

OR of Hash Functions Given family H, construct family H’ consisting of b functions from H For h = [h1,…,hb] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for at least 1 i Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,1-(1-p1)b,1-(1-p2)b)-sensitive Proof: Use the fact that hi ’s are independent 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Effect of AND and OR Constructions AND makes all probs. shrink, but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not OR makes all probs. grow, but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not AND r=1..10, b=1 Prob. sharing a bucket Prob. sharing a bucket OR r=1, b=1..10 1-(1-sr)b Similarity of a pair of items Similarity of a pair of items 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Composing Constructions r-way AND followed by b-way OR construction Exactly what we did with min-hashing If bands match in all r values hash to same bucket Cols that are hashed into at least 1 common bucket  Candidate Take points x and y s.t. Pr[h(x) = h(y)] = p H will make (x,y) a candidate pair with prob. p Construction makes (x,y) a candidate pair with probability 1-(1-pr)b The S-Curve! Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Table for Function 1-(1-p4)4 .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 .9 .9860 Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)- sensitive family. 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Picking r and b: The S-curve Picking r and b to get the best 50 hash-functions (r=5, b=10) Prob. sharing a bucket Blue area: False Negative rate Green area: False Positive rate Similarity 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Picking r and b: The S-curve Picking r and b to get the best 50 hash-functions (b*r=50) r=2, b=25 r=5, b=10 r=10, b=5 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

OR-AND Composition Apply a b-way OR construction followed by an r-way AND construction Transforms probability p into (1-(1-p)b)r. The same S-curve, mirrored horizontally and vertically Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4. 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Table for Function (1-(1-p)4)4 .1 .0140 .2 .1215 .3 .3334 .4 .5740 .5 .7725 .6 .9015 .7 .9680 .8 .9936 Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)- sensitive family. 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Cascading Constructions Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction Transforms a (0.2,0.8,0.8,.02)-sensitive family into a (0.2,0.8,0.9999996,0.0008715)-sensitive family Note this family uses 256 (=4*4*4*4) of the original hash functions 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Summary Pick any two distances x < y Start with a (x, y, (1-x), (1-y))-sensitive family Apply constructions to produce (x, y, p, q)-sensitive family, where p is almost 1 and q is almost 0 The closer to 0 and 1 we get, the more hash functions must be used 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

LSH for other Distance Metrics LSH methods for other distance metrics: Cosine: Random Hyperplanes Euclidean: Project on lines Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Hash func. Signatures: short integer signatures that reflect their similarity Points Design a (x,y,p,q)-sensitive family of hash functions (for that particular distance metric) Amplify the family using AND and OR constructions 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

LSH for Cosine Distance For cosine distance, d(A, B) =  = arccos(AB/ǁAǁǁBǁ) there is a technique called Random Hyperplanes Technique similar to minhashing Random Hyperplanes is a (d1,d2,(1-d1/180),(1-d2/180))-sensitive family for any d1 and d2 Reminder: (d1,d2,p1,p2)-sensitive If d(x,y) < d1, then prob. that h(x) = h(y) is at least p1 If d(x,y) > d2, then prob. that h(x) = h(y) is at most p2 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Random Hyperplanes Pick a random vector v, which determines a hash function hv with two buckets hv(x) = +1 if vx > 0; = -1 if vx < 0 LS-family H = set of all functions derived from any vector Claim: For points x and y, Pr[h(x) = h(y)] = 1 – d(x,y)/180 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Proof of Claim Look in the plane of x and y. x Hyperplane normal to v h(x) ≠ h(y) v Look in the plane of x and y. Hyperplane normal to v h(x) = h(y) v x Prob[Red case] = θ/180 θ y 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Signatures for Cosine Distance Pick some number of random vectors, and hash your data for each vector The result is a signature (sketch) of +1’s and –1’s for each data point Can be used for LSH like the minhash signatures for Jaccard distance Amplified using AND and OR constructions 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

How to pick random vectors? Expensive to pick a random vector in M dimensions for large M M random numbers A more efficient approach It suffices to consider only vectors v consisting of +1 and –1 components Why is this more efficient? 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

LSH for Euclidean Distance Simple idea: Hash functions correspond to lines Partition the line into buckets of size a Hash each point to the bucket containing its projection onto the line Nearby points are always close; distant points are rarely in same bucket 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Projection of Points Points at distance d If d << a, then the chance the points are in the same bucket is at least 1 – d /a. Randomly chosen line Bucket width a 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Projection of Points Points at distance d If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. θ d cos θ Randomly chosen line Bucket width a 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

An LS-Family for Euclidean Distance If points are distance d < a/2, prob. they are in same bucket ≥ 1- d/a = ½ If points are distance > 2a apart, then they can be in the same bucket only if d cos θ ≤ a cos θ ≤ ½ 60 < θ < 90 i.e., at most 1/3 probability. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions for any a Amplify using AND-OR cascades 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Fixup: Euclidean Distance For previous distance measures, we could start with an (x, y, p, q)-sensitive family for any x < y, and drive p and q to 1 and 0 by AND/OR constructions Here, we seem to need y > 4x 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Fixup – (2) But as long as x < y, the probability of points at distance x falling in the same bucket is greater than the probability of points at distance y doing so Thus, the hash family formed by projecting onto lines is an (x, y, p, q)-sensitive family for some p > q Then, amplify by AND/OR constructions 1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets