Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Slides:



Advertisements
Similar presentations
Similarity and Distance Sketching, Locality Sensitive Hashing
Advertisements

Longest Common Subsequence
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Near-Duplicates Detection
Similarity and Distance Sketching, Locality Sensitive Hashing
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
12 VECTORS AND THE GEOMETRY OF SPACE.
Min-Hashing, Locality Sensitive Hashing Clustering
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Clustering.
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Chapter 4.1 Mathematical Concepts
1 Clustering Distance Measures Hierarchical Clustering k -Means Algorithms.
Chapter 4.1 Mathematical Concepts. 2 Applied Trigonometry Trigonometric functions Defined using right triangle  x y h.
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
Distance Measures LS Families of Hash Functions S-Curves
Near Duplicate Detection
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length.
Review of Matrix Algebra
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
1 Clustering. 2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters,
Finding Similar Items.
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
VECTORS AND THE GEOMETRY OF SPACE Vectors VECTORS AND THE GEOMETRY OF SPACE In this section, we will learn about: Vectors and their applications.
5.1 Orthogonality.
Chapter 2 Dimensionality Reduction. Linear Methods
1 Preliminaries Precalculus Review I Precalculus Review II
Linear Algebra Chapter 4 Vector Spaces.
Similarity and Distance Sketching, Locality Sensitive Hashing
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Vectors and the Geometry of Space 9. Vectors 9.2.
Copyright © Cengage Learning. All rights reserved. 12 Vectors and the Geometry of Space.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Copyright © Cengage Learning. All rights reserved. CHAPTER 9 COUNTING AND PROBABILITY.
6 6.1 © 2016 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Vectors Addition is commutative (vi) If vector u is multiplied by a scalar k, then the product ku is a vector in the same direction as u but k times the.
Elementary Linear Algebra Anton & Rorres, 9th Edition
Relations and their Properties
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
DATA MINING LECTURE 5 Similarity and Distance Recommender Systems.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Problem Statement How do we represent relationship between two related elements ?
VECTORS AND THE GEOMETRY OF SPACE. The Cross Product In this section, we will learn about: Cross products of vectors and their applications. VECTORS AND.
Lesson 87: The Cross Product of Vectors IBHL - SANTOWSKI.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
 An image is the new figure, and the preimage is the original figure  Transformations-move or change a figure in some way to produce an image.
Shingling Minhashing Locality-Sensitive Hashing
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Linear Algebra Review.
Near Duplicate Detection
Clustering Algorithms
Sketching, Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions LSH for Cosine Distance Jeffrey.
More LSH LS Families of Hash Functions LSH for Cosine Distance Special Approaches for High Jaccard Similarity Jeffrey D. Ullman Stanford University.
Enumerating Distances Using Spanners of Bounded Degree
Numerical Analysis Lecture 17.
Minwise Hashing and Efficient Search
Near Neighbor Search in High Dimensional Data (1)
Three Essential Techniques for Similar Documents
Presentation transcript:

Jeffrey D. Ullman Stanford University

2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example: Jaccard similarity is not a distance; 1 minus Jaccard similarity is.

3  d is a distance measure if it is a function from pairs of points to real numbers such that: 1.d(x,y) > 0. 2.d(x,y) = 0 iff x = y. 3.d(x,y) = d(y,x). 4.d(x,y) < d(x,z) + d(z,y) (triangle inequality).

4  L 2 norm: d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension.  The most common notion of “distance.”  L 1 norm: sum of the differences in each dimension.  Manhattan distance = distance if you had to travel along coordinates only.

5 a = (5,5) b = (9,8) L 2 -norm: dist(x,y) =  ( ) = 5 L 1 -norm: dist(x,y) = 4+3 =

6  Jaccard distance for sets = 1 minus Jaccard similarity.  Cosine distance for vectors = angle between the vectors.  Edit distance for strings = number of inserts and deletes to change one string into another.

7  Consider x = {1,2,3,4} and y = {1,3,5}  Size of intersection = 2; size of union = 5, Jaccard similarity (not distance) = 2/5.  d(x,y) = 1 – (Jaccard similarity) = 3/5.

8  d(x,y) > 0 because |x  y| < |x  y|.  d(x,x) = 0 because x  x = x  x.  And if x  y, then the size of x  y is strictly less than the size of x  y.  d(x,y) = d(y,x) because union and intersection are symmetric.  d(x,y) < d(x,z) + d(z,y) trickier – next slide.

9 1 - |x  z| |y  z| > 1 -|x  y| |x  z| |y  z| |x  y|  Remember: |a  b|/|a  b| = probability that minhash(a) = minhash(b).  Thus, 1 - |a  b|/|a  b| = probability that minhash(a)  minhash(b). d(x,z) d(x,y) d(z,y)

10  Claim: prob[minhash(x)  minhash(y)] < prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)]  Proof: whenever minhash(x)  minhash(y), at least one of minhash(x)  minhash(z) and minhash(z)  minhash(y) must be true. minhash(x)  minhash(z) minhash(z)  minhash(y) minhash(x)  minhash(y

11  Think of a point as a vector from the origin (0,0,…,0) to its location.  Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors: p 1.p 2 /|p 2 ||p 1 |.  Example: p 1 = 00111; p 2 =  p 1.p 2 = 2; |p 1 | = |p 2 | =  3.  cos(  ) = 2/3;  is about 48 degrees.

12  The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other.  An equivalent definition: d(x,y) = |x| + |y| - 2|LCS(x,y)|.  LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y.

13  x = abcde ; y = bcduve.  Turn x into y by deleting a, then inserting u and v after d.  Edit distance = 3.  Or, computing edit distance through the LCS, note that LCS(x,y) = bcde.  Then:|x| + |y| - 2|LCS(x,y)| = –2*4 = 3 = edit distance.

 There is a subtlety about what a “hash function” is, in the context of LSH families.  A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison.  Example: the family of minhash functions computes minhash values and says “yes” iff they are the same.  Shorthand: “h(x) = h(y)” means h says “yes” for pair of elements x and y. 15

16  Suppose we have a space S of points with a distance measure d.  A family H of hash functions is said to be (d 1,d 2,p 1,p 2 )-sensitive if for any x and y in S: 1.If d(x,y) < d 1, then the probability over all h in H, that h(x) = h(y) is at least p 1. 2.If d(x,y) > d 2, then the probability over all h in H, that h(x) = h(y) is at most p 2.

17 d1d1 d2d2 High probability; at least p 1 Low probability; at most p 2 ??? p1p1 p2p2

18  Let:  S = subsets of some universal set,  d = Jaccard distance,  H formed from the minhash functions for all permutations of the universal set.  Then Prob[h(x)=h(y)] = 1-d(x,y).  Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance.

19  Claim: H is a (1/3, 3/4, 2/3, 1/4)-sensitive family for S and d. If distance < 1/3 (so similarity > 2/3) Then probability that minhash values agree is > 2/3 For Jaccard similarity, minhashing gives us a (d 1,d 2,(1-d 1 ),(1-d 2 ))-sensitive family for any d 1 < d 2. If distance > 3/4 (so similarity < 1/4) Then probability that minhash values agree is < 1/4

20  The “bands” technique we learned for signature matrices carries over to this more general setting.  Goal: the “S-curve” effect seen there.  AND construction like “rows in a band.”  OR construction like “many bands.”

21  Given family H, construct family H’ whose members each consist of r functions from H.  For h = {h 1,…,h r } in H’, h(x)=h(y) if and only if h i (x)=h i (y) for all i.  Theorem: If H is (d 1,d 2,p 1,p 2 )-sensitive, then H’ is ( d 1,d 2,(p 1 ) r,(p 2 ) r ) -sensitive.  Proof: Use fact that h i ’s are independent.

22  Given family H, construct family H’ whose members each consist of b functions from H.  For h = {h 1,…,h b } in H’, h(x)=h(y) if and only if h i (x)=h i (y) for some i.  Theorem: If H is (d 1,d 2,p 1,p 2 )-sensitive, then H’ is ( d 1,d 2,1-(1-p 1 ) b,1-(1-p 2 ) b ) -sensitive.

23  AND makes all probabilities shrink, but by choosing r correctly, we can make the lower probability approach 0 while the higher does not.  OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

24  As for the signature matrix, we can use the AND construction followed by the OR construction.  Or vice-versa.  Or any sequence of AND’s and OR’s alternating.

25  Each of the two probabilities p is transformed into 1-(1-p r ) b.  The “S-curve” studied before.  Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4.

26 p1-(1-p 4 ) Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)- sensitive family.

27  Each of the two probabilities p is transformed into (1-(1-p) b ) r.  The same S-curve, mirrored horizontally and vertically.  Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4.

28 p(1-(1-p) 4 ) Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)- sensitive family.

29  Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction.  Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8, , )-sensitive family.

30  For each AND-OR S-curve 1-(1-p r ) b, there is a threshold t, for which 1-(1-t r ) b = t.  Above t, high probabilities are increased; below t, low probabilities are decreased.  You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t.  Iterate as you like.  Similar observation for the OR-AND type of S- curve: (1-(1-p) b ) r.

Threshold t t 31 Probability Is lowered Probability Is raised p

33  For cosine distance, there is a technique analogous to minhashing for generating a ( d 1,d 2,(1-d 1 /180),(1-d 2 /180) ) -sensitive family for any d 1 and d 2.  Called random hyperplanes.

34  Each vector v determines a hash function h v with two buckets.  h v (x) = +1 if v.x > 0; h v (x) = -1 if v.x < 0.  LS-family H = set of all functions derived from any vector v.  Claim: Prob[h(x)=h(y)] = 1 – (angle between x and y divided by 180).

35 x y Look in the plane of x and y. Prob[Red case] = θ/180 θ Hyperplanes (normal to v ) for which h(x) ≠ h(y) v Hyperplanes for which h(x) = h(y)

36  Pick some number of vectors, and hash your data for each vector.  The result is a signature (sketch) of +1’s and –1’s that can be used for LSH like the minhash signatures for Jaccard distance.  But you don’t have to think this way.  The existence of the LSH-family is sufficient for amplification by AND/OR.

37  We need not pick from among all possible vectors v to form a component of a sketch.  It suffices to consider only vectors v consisting of +1 and –1 components.

39  We’ll again talk about Jaccard similarity and distance of sets.  However, now represent sets by strings (lists of symbols): 1.Order the universal set. 2.Represent a set by the string of its elements in sorted order.

40  If the universal set is k-shingles, there is a natural lexicographic order.  Think of each shingle as a single symbol.  Then the 2-shingling of abcad, which is the set {ab, bc, ca, ad}, is represented by the list (string) [ab, ad, bc, ca] of length 4.

41  If we treat a document as a set of words, we could order the words lexicographically.  Better: Order words lowest-frequency-first.  Why? We shall bucketize documents based on the early words in their lists.  Documents spread over more buckets.

42  Suppose two sets have Jaccard distance J and are represented by strings s 1 and s 2. Let the LCS of s 1 and s 2 have length C and the (insert/delete) edit distance of s 1 and s 2 be E. Then:  1-J = Jaccard similarity = C/(C+E).  J = E/(C+E). Works because these strings never repeat a symbol, and symbols appear in the same order. Example: s 1 = acefh; s 2 = bcdegh. LCS = ceh; C = 3; E = 5; 1-J = 3/8.

43  The simplest thing to do is create an index on the length of strings.  A set whose string has length L can be Jaccard distance J from a set whose string has length M only if L  (1-J) < M < L/(1-J).  Example: if 1-J = 90% (Jaccard similarity), then M is between 90% and 111% of L.

44 L M 1-J < M/L M > L  (1-J) A shortest candidate M L 1-J < L/M M < L/(1-J) A longest candidate

45  Example: If two strings are 90% similar, they must share some symbol in their prefixes.  These prefixes are of length just above 10% of the length of each string.  In general: we can base an index on symbols in just the first ⌊ JL+1 ⌋ positions of a string of length L.

46 L E If two strings do not share any of the first E symbols of the first string, then J > E/L. Thus, E = JL is possible, but any larger E is impossible. Index E+1 positions. x x Extreme case: second string has none of the first E symbols of the first string, but they agree thereafter. Must be Equal Suppose a string of length L has E symbols Before the first match with a second string.

47  Think of a bucket for each possible symbol.  Each string of length L is placed in the bucket for the symbols in each of its first ⌊ JL+1 ⌋ positions.

48  Given a probe string s of length L, with J the limit on Jaccard distance: for (each symbol a among the first ⌊ JL+1 ⌋ positions of s) look for other strings in the bucket for a;

49  Let J = 0.2.  String abcdef is indexed under a and b.  ⌊ (0.2)*6 +1 ⌋= 2.  String acdfg is indexed under a and c.  ⌊ (0.2)*5 +1 ⌋= 2.  String bcde is indexed only under b.  ⌊ (0.2)*4 +1 ⌋= 1.  If we search for strings similar to cdef, we need look only in the bucket for c.