1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.

Slides:



Advertisements
Similar presentations
Applications Shingling Minhashing Locality-Sensitive Hashing
Advertisements

Hash-Based Improvements to A-Priori
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Similarity and Distance Sketching, Locality Sensitive Hashing
Data Mining of Very Large Data
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Near-Duplicates Detection
High Dimensional Search Min-Hashing Locality Sensitive Hashing
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Association Rule Mining
My Favorite Algorithms for Large-Scale Data Mining
Data Mining: Associations
Improvements to A-Priori
Improvements to A-Priori
Distance Measures LS Families of Hash Functions S-Curves
Near Duplicate Detection
1 More Stream-Mining Counting How Many Elements Computing “Moments”
Fitting a Model to Data Reading: 15.1,
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Hashing General idea: Get a large array
Finding Similar Items.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
CSE 373 Data Structures Lecture 15
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
1 Low-Support, High-Correlation Finding rare, but very similar items.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Query Processing CS 405G Introduction to Database Systems.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Sketching, Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Shingling Minhashing Locality-Sensitive Hashing
Theory of Locality Sensitive Hashing
Evaluation of Relational Operations
Lecture 2- Query Processing (continued)
Database Design and Programming
ITEC 2620M Introduction to Data Structures
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Three Essential Techniques for Similar Documents
Locality-Sensitive Hashing
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications

2 Finding Similar Pairs uSuppose we have in main memory data representing a large number of objects. wMay be the objects themselves (e.g., summaries of faces). wMay be signatures as in minhashing. uWe want to compare each to each, finding those pairs that are sufficiently similar.

3 Candidate Generation From Minhash Signatures uPick a similarity threshold s, a fraction < 1. uA pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows. wI.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

4 Candidate Generation --- (2) uFor images, a pair of vectors is a candidate if they differ by at most a small threshold t in at least s % of the components. uFor entity records, a pair is a candidate if the sum of similarity scores of corresponding components exceeds a threshold.

5 The Problem with Checking for Candidates uWhile the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. uExample: 10 6 columns implies 5*10 11 comparisons. uAt 1 microsecond/comparison: 6 days.

6 Solutions 1.Divide-Compute-Merge (DCM) uses external sorting, merging. 2.Locality-Sensitive Hashing (LSH) can be carried out in main memory, but admits some false negatives. 3.Hamming LSH --- a variant LSH method.

7 Divide-Compute-Merge uDesigned for “shingles” and docs. uAt each stage, divide data into batches that fit in main memory. uOperate on individual batches and write out partial results to disk. uMerge partial results from disk.

8 doc1: s11,s12,…,s1k doc2: s21,s22,…,s2k … DCM Steps s11,doc1 s12,doc1 … s1k,doc1 s21,doc2 … Invert t1,doc11 t1,doc12 … t2,doc21 t2,doc22 … sort on shingleId doc11,doc12,1 doc11,doc13,1 … doc21,doc22,1 … Invert and pair doc11,doc12,1 … doc11,doc13,1 … sort on <docId1, docId2> doc11,doc12,2 doc11,doc13,10 … Merge

9 DCM Summary 1.Start with the pairs. 2.Sort by shingleId. 3.In a sequential scan, generate triplets for pairs of docs that share a shingle. 4.Sort on. 5.Merge triplets with common docIds to generate triplets of the form. 6.Output document pairs with count > threshold.

10 Some Optimizations u“Invert and Pair” is the most expensive step. uSpeed it up by eliminating very common shingles. w“the”, “404 not found”, “<A HREF”, etc. uAlso, eliminate exact-duplicate docs first.

11 Locality-Sensitive Hashing uBig idea: hash columns of signature matrix M several times. uArrange that (only) similar columns are likely to hash to the same bucket. uCandidate pairs are those that hash at least once to the same bucket.

12 Partition Into Bands Matrix M r rows per band b bands

13 Partition into Bands --- (2) uDivide matrix M into b bands of r rows. uFor each band, hash its portion of each column to a hash table with k buckets.  Candidate column pairs are those that hash to the same bucket for ≥ 1 band. uTune b and r to catch most similar pairs, but few nonsimilar pairs.

14 Matrix M r rows b bands Buckets

15 Simplifying Assumption uThere are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. uHereafter, we assume that “same bucket” means “identical.”

16 Example uSuppose 100,000 columns. uSignatures of 100 integers. uTherefore, signatures take 40Mb. uBut 5,000,000,000 pairs of signatures can take a while to compare. uChoose 20 bands of 5 integers/band.

17 Suppose C 1, C 2 are 80% Similar uProbability C 1, C 2 identical in one particular band: (0.8) 5 = uProbability C 1, C 2 are not similar in any of the 20 bands: ( ) 20 = wi.e., we miss about 1/3000th of the 80%- similar column pairs.

18 Suppose C 1, C 2 Only 40% Similar uProbability C 1, C 2 identical in any one particular band: (0.4) 5 =  Probability C 1, C 2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2.  But false positives much lower for similarities << 40%.

19 LSH Involves a Tradeoff uPick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. uExample: if we had fewer than 20 bands, the number of false positives would go down, but the number of false negatives would go up.

20 LSH --- Graphically u Example Target: All pairs with Sim > t. uSuppose we use only one hash function: uPartition into bands gives us: s1.0 Sim Prob. 1.0 t Sim Prob Ideal Sim 0.0 Prob. 1.0 s 1 – (1 – s r ) b 0.0 t t t ~ (1/b) 1/r

21 LSH Summary uTune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. uCheck in main memory that candidate pairs really do have similar signatures. uOptional: In another pass through data, check that the remaining candidate pairs really are similar columns.

22 New Topic: Hamming LSH uAn alternative to minhash + LSH. uTakes advantage of the fact that if columns are not sparse, random rows serve as a good signature. uTrick: create data matrices of exponentially decreasing sizes, increasing densities.

23 Amplification of 1’s uHamming LSH constructs a series of matrices, each with half as many rows, by OR-ing together pairs of rows. uCandidate pairs from each matrix have (say) between 20% - 80% 1’s and are similar in selected 100 rows.  20%-80% OK for similarity thresholds ≥ 0.5. Otherwise, two “similar” columns with widely differing numbers of 1’s could fail to both be in range for at least one matrix.

24 Example

25 Using Hamming LSH uConstruct the sequence of matrices. wIf there are R rows, then log 2 R matrices. wTotal work = twice that of reading the original matrix. uUse standard LSH on a random selection of rows to identify similar columns in each matrix, but restricted to columns of “medium” density.

26 LSH for Other Applications 1.Face recognition from 1000 measurements/face. 2.Entity resolution from name-address- phone records. uGeneral principle: find many hash functions for elements; candidate pairs share a bucket for > 1 hash.

27 Face-Recognition Hash Functions 1.Pick a set of r of the 1000 measurements. 2.Each bucket corresponds to a range of values for each of the r measurements. 3.Hash a vector to the bucket such that each of its r components is in-range. 4.Optional: if near the edge of a range, also hash to an adjacent bucket.

28 Example: r = One bucket, for (x,y) if 10<x<16 and 0<y<4 (27,9) goes here. Maybe put a copy here, too.

29 Many-One Face Lookup uAs for boolean matrices, use many different hash functions. wEach based on a different set of the 1000 measurements. uEach bucket of each hash function points to the images that hash to that bucket.

30 Face Lookup --- (2) uGiven a new image (the probe ), hash it according to all the hash functions. uAny member of any one of its buckets is a candidate. uFor each candidate, count the number of components in which the candidate and probe are close. uMatch if #components > threshold.

31 Hashing the Probe h1h1 h2h2 h3h3 h4h4 h5h5 probe Look in all these buckets

32 Many-Many Problem uMake each pair of images that are in the same bucket according to any hash function be a candidate pair. uScore each candidate pair as for the many-one problem.

33 Entity Resolution uYou don’t have the convenient multidimensional view of data that you do for “face-recognition” or “similar- columns.” uWe actually used an LSH-inspired simplification.

34 Entity Resolution --- (2) uThree hash functions: 1.One bucket for each name string. 2.One bucket for each address string. 3.One bucket for each phone string. uA pair is a candidate iff they mapped to the same bucket for at least one of the three hashes.