My Favorite Algorithms for Large-Scale Data Mining

Slides:

Advertisements

Similar presentations

Lecture outline Nearest-neighbor search in low dimensions

Advertisements

Applications Shingling Minhashing Locality-Sensitive Hashing

Similarity and Distance Sketching, Locality Sensitive Hashing

Data Mining of Very Large Data

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

CSCE 3400 Data Structures & Algorithm Analysis

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Randomized / Hashing Algorithms

Min-Hashing, Locality Sensitive Hashing Clustering

1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.

CSC1016 Coursework Clarification Derek Mortimer March 2010.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Association Rule Mining

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Data Mining: Associations

1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.

Improvements to A-Priori

Distance Measures LS Families of Hash Functions S-Curves

Near Duplicate Detection

1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

1 Decidability Turing Machines Coded as Binary Strings Diagonalizing over Turing Machines Problems as Languages Undecidable Problems.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Lecture 10: Search Structures and Hashing

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Finding Similar Items.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.

1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.

Similarity and Distance Sketching, Locality Sensitive Hashing

1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

1 Low-Support, High-Correlation Finding rare, but very similar items.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Netflix Challenge: Combined Collaborative Filtering Greg Nelson Alan Sheinberg.

Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Shingling Minhashing Locality-Sensitive Hashing

CS276A Text Information Retrieval, Mining, and Exploitation

Near Duplicate Detection

Finding Similar Items Jeffrey D. Ullman Application: Similar Documents

Sketching, Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Shingling Minhashing Locality-Sensitive Hashing

Theory of Locality Sensitive Hashing

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Searching Similar Segments over Textual Event Sequences

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Minwise Hashing and Efficient Search

Near Neighbor Search in High Dimensional Data (1)

Three Essential Techniques for Similar Documents

Locality-Sensitive Hashing

Presentation transcript:

My Favorite Algorithms for Large-Scale Data Mining Shingling Minhashing Locality-Sensitive Hashing

Similarity Search A universal set of “objects.” A collection of sets of objects. Find the pairs of sets that are “similar.” Jaccard similarity of sets = size of intersection divided by size of union.

Example: Jaccard Similarity 3 in intersection. 8 in union. Jaccard similarity = 3/8

Applications Collaborative Filtering : Amazon customers as the set of products they buy. Recommend what similar customers bought. Similar Documents : A document as its set of k-shingles = strings of k consecutive characters. Examples: news articles from same source, plagiarism.

Applications – (2) Fingerprint Checking : Represent a fingerprint by the set of positions of minutiae. Requires discretization. Entity Resolution : Represent records describing individuals by sets of attribute/value pairs.

Key Ideas Shingling : (Andrei Broder) Convert documents into sets. Minhashing : (Edith Cohen, Broder) Construct small signatures for sets so Jaccard similarity of sets can be determined from the signatures. Locality-Sensitive Hashing : (Rajeev Motwani, Piotr Indyk) Focus on (likely) similar pairs without looking at all pairs.

The Big Picture – Documents Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the document

The Big Picture – Fingerprints Locality- sensitive Hashing Candidate pairs : those pairs of fingerprints that we need to test for similarity. Extract minutiae and discret- ize Finger- print Optional minhashing here

When Is Similarity Interesting? When the sets are so large or so many that they cannot fit in main memory. When there are so many sets that comparing all pairs of sets takes too much time.

k -Shingles Documents as Sets Shingling k -Shingles Documents as Sets

Shingles A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document. Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. Option: regard shingles as a bag, and count ab twice. Represent a doc by its set of k -shingles.

Working Assumption Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful: you must pick k large enough, or most documents will have most shingles. k = 5 is OK for short documents; k = 10 is better for long documents.

Shingles: Compression Option To compress long shingles, we can hash them to (say) 4 bytes. Represent a doc by the set of hash values of its k-shingles. Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared.

Matrix Formulation Signatures Similarity of Signatures Minhashing Matrix Formulation Signatures Similarity of Signatures

Similarity as a Matrix Problem Think of sets represented by a matrix of 0’s and 1’s. Row = object. Column = set. 1 means that object is in that set.

Example: Similarity of Columns C1 C2 u 0 1 v 1 0 w 1 1 Sim (C1, C2) = x 0 0 2/5 = 0.4 y 1 1 z 0 1 C1 = {v,w,y} C2 = {u,w,y,z} * *

Four Types of Rows Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ).

Minhashing Imagine the rows permuted randomly. Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. Use several (100?) independent hash functions to create a signature with that number of integer hash-values.

Minhashing Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

Surprising Property The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). Both are a /(a +b +c )! Why? Look down columns C1 and C2 (in permuted order) until we see a 1. If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures The similarity of signatures is the fraction of the rows in which they agree. Remember, each row corresponds to a permutation or “hash function.”

Implementation – (1) You can’t really permute rows physically. Good approximation to permuting rows: pick 100 (?) hash functions. For each column c and each hash function hi , keep a “slot” M (i, c ) for that minhash value.

Implementation – (2) for each row r for each column c if c has 1 in row r for each hash function hi do if hi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 h(2) = 2 1 2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x +1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (3) Often, data is given by column, not row. E.g., columns = documents, rows = shingles. If so, sort matrix once so it is by row. And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing The All-Pairs Problem Banding of Signature Matrices Other LSH Techniques

Finding Similar Sets We can use minhashing to replace sets (columns of the matrix) by short lists of integers. But we still need to compare each pair of signatures. Example: 20 million Amazon customers; 2*1014 pairs of customers to evaluate.

Locality-Sensitive Hashing What we want seems impossible. Map signatures to buckets so that: Two similar signatures have a very good chance of appearing in the same bucket. If two signatures are not very similar, they probably don’t appear in one bucket. Then, we only have to compare bucket-mates (candidate pairs ).

LSH for Signatures Think of the signature for each column (set) as a column of the signature matrix S. Divide the rows of S into b bands of r rows each.

Partition Into Bands – (1) r rows per band b bands One signature Matrix S

Partition into Bands – (2) For each band, hash its portion of each column to a hash table with many buckets. Candidate column pairs are those that hash to the same bucket for ≥ 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Buckets Matrix S b bands r rows

Analysis of LSH – What We Want Probability = 1 if s > t Probability of sharing a bucket No chance if s < t t Similarity s of two sets

What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

What b Bands of r Rows Gives You ( )b No bands identical 1 - At least one band identical 1 - Some row of a band unequal s r All rows of a band are equal t ~ (1/b)1/r Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5 s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996

Summary of Minhash/LSH Represent the objects you are comparing by sets (e.g., shingling). Represent the sets by signatures (minhashing). Use LSH to create buckets; candidate pairs are those in the same bucket. Evaluate only the candidate pairs.

Application: LSH for Fingerprints Place a grid on a fingerprint. Normalize so identical prints will overlap. Set of grid points where minutiae are located represents the fingerprint. Possibly, treat minutiae near a grid boundary as if also present in adjacent grid points.

Discretizing Minutiae located here Maybe pretend it is here also

Applying LSH to Fingerprints We could minhash the bit-vectors to obtain signatures. But since there probably aren’t too many grid points, we can work from the bit-vectors directly.

LSH/Fingerprints – (2) Pick 100 (?) sets of 3 (?) grid points, randomly. For each set of three grid points, those prints that have 1 for all three points are placed in a bucket. All pairs in this bucket are candidates.

Application: Matching Customer Records I once took a consulting job solving the following problem: Company A agreed to solicit customers for Company B, for a fee. They then argued over how many customers. Neither recorded exactly which customers were involved.

Customer Records – (2) Company B had about 1 million records of all its customers. Company A had about 1 million records describing customers, some of which it had signed up for B. Records had name, address, and phone, but for various reasons, they could be different for the same person.

Customer Records – (3) Step 1: Design a measure (“score ”) of how similar records are: E.g., deduct points for small misspellings (“Jeffrey” vs. “Geoffery”) or same phone with different area code. Step 2: Score all pairs of records; report high scores as matches.

Customer Records – (4) Problem: (1 million)2 is too many pairs of records to score. Solution: A simple LSH. Three hash functions: exact values of name, address, phone. Compare iff records are identical in at least one. Misses similar records with a small difference in all three fields.

Aside: Validation of Results We were able to tell what values of the scoring function were reliable in an interesting way. Identical records had a creation date difference of 10 days. We only looked for records created within 90 days, so bogus matches had a 45-day average.

Validation – (2) By looking at the pool of matches with a fixed score, we could compute the average time-difference, say x, and deduce that fraction (45-x )/35 of them were valid matches. Alas, the lawyers didn’t think the jury would understand.

The End Thanks for Listening