1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Slides:



Advertisements
Similar presentations
Lecture outline Nearest-neighbor search in low dimensions
Advertisements

Applications Shingling Minhashing Locality-Sensitive Hashing
Hash-Based Improvements to A-Priori
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Similarity and Distance Sketching, Locality Sensitive Hashing
Data Mining of Very Large Data
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Association Rule Mining
My Favorite Algorithms for Large-Scale Data Mining
Data Mining: Associations
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
Improvements to A-Priori
Improvements to A-Priori
Distance Measures LS Families of Hash Functions S-Curves
Near Duplicate Detection
1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length.
1 More Stream-Mining Counting How Many Elements Computing “Moments”
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
1 Clustering. 2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters,
Finding Similar Items.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
Similarity and Distance Sketching, Locality Sensitive Hashing
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
1 Low-Support, High-Correlation Finding rare, but very similar items.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
Shingling Minhashing Locality-Sensitive Hashing
CS276A Text Information Retrieval, Mining, and Exploitation
Recommender Systems 01 , Content Based , Collaborative Filtering
Hash table CSC317 We have elements with key and satellite data
Near Duplicate Detection
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Sketching, Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Shingling Minhashing Locality-Sensitive Hashing
Theory of Locality Sensitive Hashing
Multidimensional Scaling
Minwise Hashing and Efficient Search
CSE 326: Data Structures Lecture #14
Near Neighbor Search in High Dimensional Data (1)
Three Essential Techniques for Similar Documents
Locality-Sensitive Hashing
Presentation transcript:

1 Near-Neighbor Search Applications Matrix Formulation Minhashing

2 Example Application: Face Recognition uWe have a database of (say) 1 million face images. uWe want to find the most similar images in the database. uRepresent faces by (relatively) invariant values, e.g., ratio of nose width to eye width.

3 Face Recognition – (2) uEach image represented by a large number (say 1000) of numerical features. uProblem: given a face, find those in the DB that are close in at least ¾ (say) of the features.

4 Face Recognition – (3) uMany-one problem : given a new face, see if it is close to any of the 1 million old faces. uMany-Many problem : which pairs of the 1 million faces are similar.

5 Simple Solution uRepresent each face by a vector of 1000 values and score the comparisons. uSort-of OK for many-one problem. uOut of the question for the many-many problem (10 6 *10 6 *1000/2 numerical comparisons). uWe can do better!

6 Multidimensional Indexes Don’t Work New face: [6,14,…] Dimension 1 = Surely we’d better look here. Maybe look here too, in case of a slight error. But the first dimension could be one of those that is not close. So we’d better look everywhere!

7 Another Problem: Entity Resolution uTwo sets of 1 million name-address- phone records. uSome pairs, one from each set, represent the same person. uErrors of many kinds: wTypos, missing middle initial, area-code changes, St./Street, Bob/Robert, etc., etc.

8 Entity Resolution – (2) uChoose a scoring system for how close names are. wDeduct so much for edit distance > 0; so much for missing middle initial, etc. uSimilarly score differences in addresses, phone numbers. uSufficiently high total score -> records represent the same entity.

9 Simple Solution uCompare each pair of records, one from each set. uScore the pair. uCall them the same if the score is sufficiently high. uUnfeasible for 1 million records. uWe can do better!

10 Example: Similar Customers uCommon pattern: looking for sets with a relatively large intersection. uRepresent a customer, e.g., of Netflix, by the set of movies they rented. uSimilar customers have a relatively large fraction of their choices in common.

11 Example: Similar Products uDual view of product-customer relationship. uProducts are similar if they are bought by many of the same customers. uE.g., movies of the same genre are typically rented by similar sets of Netflix customers. wTricky: Sony and Samsung TV’s are “similar,” but not typically bought by the same customers.

12 Yet Another Problem: Finding Similar Documents uGiven a body of documents, e.g., the Web, find pairs of docs that have a lot of text in common, e.g.: wMirror sites, or approximate mirrors. wPlagiarism, including large quotations. wRepetitions of news articles at news sites.

13 Complexity of Document Similarity uFor the face problem, there is a way to represent a big image by a (relatively) small data-set. uEntity records represent themselves. uHow do you represent a document so it is easy to compare with others?

14 Complexity – (2) uSpecial cases are easy, e.g., identical documents, or one document contained verbatim in another. uGeneral case, where many small pieces of one doc appear out of order in another, is very hard.

15 Roadmap Sets or Boolean matrices Similar customers Similar products Signatures Buckets containing mostly similar items Technique: Minhashing Face- recognition Entity- resolution Documents Technique: Shingling Technique: Locality-Sensitive Hashing

16 Representing Documents for Similarity Search 1.Represent doc by its set of shingles (or k -grams). 2.Summarize shingle set by a signature = small data-set with the property: wSimilar documents are very likely to have “similar” signatures. uAt that point, doc problem becomes finding similar sets.

17 Shingles uA k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document. uExample: k=2; doc = abcab. Set of 2- shingles = {ab, bc, ca}. wOption: regard shingles as a bag, and count ab twice.

18 Shingles: Compression Option uTo compress long shingles, we can hash them to (say) 4 bytes. uRepresent a doc by the set of hash values of its k-shingles. uTwo documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared.

19 MinHashing Data as Sparse Matrices Jaccard Similarity Measure Constructing Signatures

20 Basic Data Model: Sets uMany similarity problems can be couched as finding subsets of some universal set that have large intersection. uExamples include: 1.Documents represented by their set of shingles (or hashes of those shingles). 2.Similar customers or products.

21 From Sets to Boolean Matrices uRows = elements of the universal set. uColumns = sets. u1 in the row for element e and the column for set S iff e is a member of S.

22 In Matrix Form STUVWa11010b10110c10010d01001e10101f11011g01011h01010STUVWa11010b10110c10010d01001e10101f11011g01011h01010 S = {a,b,c,e,f} T = {a,d,f,g,h} U = {b,e} V = {a,b,c,f,g,h} W = {d,e,f,g}

23 Documents in Matrix Form uRows = shingles (or hashes of shingles). uColumns = documents. u1 in row r, column c iff document c has shingle r. uExpect the matrix to be sparse.

24 Aside uWe might not really represent the data by a boolean matrix. uSparse matrices are usually better represented by the list of places where there is a non-zero value. wE.g., movies rented by a customer, shingle-sets. uBut the matrix picture is conceptually useful.

25 Assumptions 1.Number of items allows a small amount of main-memory/item. uE.g., main memory = Number of items * Too many items to store anything in main-memory for each pair of items.

26 Similarity of Columns uRemember: a column is the set of rows in which it has 1. uThe similarity of columns C 1 and C 2 = Sim (C 1,C 2 ) = is the ratio of the sizes of the intersection and union of C 1 and C 2. wSim (C 1,C 2 ) = |C 1  C 2 |/|C 1  C 2 | = Jaccard similarity.

27 Example: Jaccard Similarity C 1 C Sim (C 1, C 2 ) = 002/5 = * * * * * * *

28 Outline: Finding Similar Columns 1.Compute signatures of columns = small summaries of columns. wRead from disk to main memory. 2.Examine signatures in main memory to find similar signatures. wEssential: similarities of signatures and columns are related. 3.Optional: check that columns with similar signatures are really similar.

29 Warnings 1.Comparing all pairs of signatures may take too much time, even if not too much space. wA job for Locality-Sensitive Hashing. 2.These methods can produce false negatives, and even false positives if the optional check is not made.

30 Signatures uKey idea: “hash” each column C to a small signature Sig (C), such that: 1.Sig (C) is small enough that we can fit a signature in main memory for each column. 2.Sim (C 1, C 2 ) is the same as the “similarity” of Sig (C 1 ) and Sig (C 2 ).

31 An Idea That Doesn’t Work uPick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows. uBecause the matrix is sparse, many columns would have as a signature, yet be very dissimilar because their 1’s are in different rows.

32 Four Types of Rows uGiven columns C 1 and C 2, rows may be classified as: C 1 C 2 a11 b10 c01 d00 uAlso, a = # rows of type a, etc. uNote Sim (C 1, C 2 ) = a /(a +b +c ).

33 Minhashing uImagine the rows permuted randomly. uDefine “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. uUse several (100?) independent hash functions to create a signature.

34 Minhashing Example Input matrix Signature matrix M

35 Surprising Property uThe probability (over all permutations of the rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1, C 2 ). uBoth are a /(a +b +c )! uWhy? wLook down columns C 1 and C 2 until we see a 1. wIf it’s a type-a row, then h (C 1 ) = h (C 2 ). If a type-b or type-c row, then not.

36 Similarity for Signatures uThe similarity of signatures is the fraction of the rows in which they agree. wRemember, each row corresponds to a permutation or “hash function.”

37 Min Hashing – Example Input matrix Signature matrix M Similarities: Col/Col Sig/Sig

38 Minhash Signatures uPick (say) 100 random permutations of the rows. uThink of Sig (C) as a column vector. uLet Sig (C)[i] = according to the i th permutation, the number of the first row that has a 1 in column C.

39 Implementation – (1) uSuppose 1 billion rows. uHard to pick a random permutation from 1…billion. uRepresenting a random permutation requires 1 billion entries. uAccessing rows in permuted order leads to thrashing.

40 Implementation – (2) uA good approximation to permuting rows: pick (say) 100 hash functions. uFor each column c and each hash function h i, keep a “slot” M (i, c ) for that minhash value.

41 Implementation – (3) for each row r for each column c if c has 1 in row r for each hash function h i do if h i (r ) is a smaller value than M (i, c ) then M (i, c ) := h i (r );

42 Example RowC1C h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120 Sig1Sig2

43 Implementation – (4) uIf data is stored row-by-row, then only one pass is needed. uIf data is stored column-by-column wE.g., data is a sequence of documents represent it by (row-column) pairs and sort once by row. wSaves cost of computing h (r ) many times.