My Favorite Algorithms for Large-Scale Data Mining

Name: My Favorite Algorithms for Large-Scale Data Mining
Uploaded: 2017-07-12T04:21:50+00:00
Duration: PTM16S31
Description: My Favorite Algorithms for Large-Scale Data Mining

My Favorite Algorithms for Large-Scale Data Mining
Shingling Minhashing Locality-Sensitive Hashing

Similarity Search A universal set of “objects.”
A collection of sets of objects. Find the pairs of sets that are “similar.” Jaccard similarity of sets = size of intersection divided by size of union.

Example: Jaccard Similarity
3 in intersection. 8 in union. Jaccard similarity = 3/8

Applications Collaborative Filtering : Amazon customers as the set of products they buy. Recommend what similar customers bought. Similar Documents : A document as its set of k-shingles = strings of k consecutive characters. Examples: news articles from same source, plagiarism.

Applications – (2) Fingerprint Checking : Represent a fingerprint by the set of positions of minutiae. Requires discretization. Entity Resolution : Represent records describing individuals by sets of attribute/value pairs.

Key Ideas Shingling : (Andrei Broder) Convert documents into sets.
Minhashing : (Edith Cohen, Broder) Construct small signatures for sets so Jaccard similarity of sets can be determined from the signatures. Locality-Sensitive Hashing : (Rajeev Motwani, Piotr Indyk) Focus on (likely) similar pairs without looking at all pairs.

The Big Picture – Documents
Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the document

The Big Picture – Fingerprints
Locality- sensitive Hashing Candidate pairs : those pairs of fingerprints that we need to test for similarity. Extract minutiae and discret- ize Finger- print Optional minhashing here

When Is Similarity Interesting?
When the sets are so large or so many that they cannot fit in main memory. When there are so many sets that comparing all pairs of sets takes too much time.

k -Shingles Documents as Sets
Shingling k -Shingles Documents as Sets

Shingles A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document. Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. Option: regard shingles as a bag, and count ab twice. Represent a doc by its set of k -shingles.

Working Assumption Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful: you must pick k large enough, or most documents will have most shingles. k = 5 is OK for short documents; k = 10 is better for long documents.

Shingles: Compression Option
To compress long shingles, we can hash them to (say) 4 bytes. Represent a doc by the set of hash values of its k-shingles. Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared.

Matrix Formulation Signatures Similarity of Signatures
Minhashing Matrix Formulation Signatures Similarity of Signatures

Similarity as a Matrix Problem
Think of sets represented by a matrix of 0’s and 1’s. Row = object. Column = set. 1 means that object is in that set.

Example: Similarity of Columns
C1 C2 u 0 1 v 1 0 w 1 1 Sim (C1, C2) = x /5 = 0.4 y 1 1 z 0 1 C1 = {v,w,y} C2 = {u,w,y,z} * *

Four Types of Rows Given columns C1 and C2, rows may be classified as:
C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ).

Minhashing Imagine the rows permuted randomly.
Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. Use several (100?) independent hash functions to create a signature with that number of integer hash-values.

Minhashing Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

Surprising Property The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). Both are a /(a +b +c )! Why? Look down columns C1 and C2 (in permuted order) until we see a 1. If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures
The similarity of signatures is the fraction of the rows in which they agree. Remember, each row corresponds to a permutation or “hash function.”

Implementation – (1) You can’t really permute rows physically.
Good approximation to permuting rows: pick 100 (?) hash functions. For each column c and each hash function hi , keep a “slot” M (i, c ) for that minhash value.

Implementation – (2) for each row r for each column c
if c has 1 in row r for each hash function hi do if hi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 h(2) = 2 1 2
h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x +1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (3) Often, data is given by column, not row.
E.g., columns = documents, rows = shingles. If so, sort matrix once so it is by row. And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing
The All-Pairs Problem Banding of Signature Matrices Other LSH Techniques

Finding Similar Sets We can use minhashing to replace sets (columns of the matrix) by short lists of integers. But we still need to compare each pair of signatures. Example: 20 million Amazon customers; 2*1014 pairs of customers to evaluate.

Locality-Sensitive Hashing
What we want seems impossible. Map signatures to buckets so that: Two similar signatures have a very good chance of appearing in the same bucket. If two signatures are not very similar, they probably don’t appear in one bucket. Then, we only have to compare bucket-mates (candidate pairs ).

LSH for Signatures Think of the signature for each column (set) as a column of the signature matrix S. Divide the rows of S into b bands of r rows each.

Partition Into Bands – (1)
r rows per band b bands One signature Matrix S

Partition into Bands – (2)
For each band, hash its portion of each column to a hash table with many buckets. Candidate column pairs are those that hash to the same bucket for ≥ 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Buckets Matrix S b bands r rows

Analysis of LSH – What We Want
Probability = 1 if s > t Probability of sharing a bucket No chance if s < t t Similarity s of two sets

What One Band of One Row Gives You
Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

What b Bands of r Rows Gives You
( )b No bands identical 1 - At least one band identical 1 - Some row of a band unequal s r All rows of a band are equal t ~ (1/b)1/r Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5 s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6
.802 .7 .975 .8 .9996

Summary of Minhash/LSH
Represent the objects you are comparing by sets (e.g., shingling). Represent the sets by signatures (minhashing). Use LSH to create buckets; candidate pairs are those in the same bucket. Evaluate only the candidate pairs.

Application: LSH for Fingerprints
Place a grid on a fingerprint. Normalize so identical prints will overlap. Set of grid points where minutiae are located represents the fingerprint. Possibly, treat minutiae near a grid boundary as if also present in adjacent grid points.

Discretizing Minutiae
located here Maybe pretend it is here also

Applying LSH to Fingerprints
We could minhash the bit-vectors to obtain signatures. But since there probably aren’t too many grid points, we can work from the bit-vectors directly.

LSH/Fingerprints – (2) Pick 100 (?) sets of 3 (?) grid points, randomly. For each set of three grid points, those prints that have 1 for all three points are placed in a bucket. All pairs in this bucket are candidates.

Application: Matching Customer Records
I once took a consulting job solving the following problem: Company A agreed to solicit customers for Company B, for a fee. They then argued over how many customers. Neither recorded exactly which customers were involved.

Customer Records – (2) Company B had about 1 million records of all its customers. Company A had about 1 million records describing customers, some of which it had signed up for B. Records had name, address, and phone, but for various reasons, they could be different for the same person.

Customer Records – (3) Step 1: Design a measure (“score ”) of how similar records are: E.g., deduct points for small misspellings (“Jeffrey” vs. “Geoffery”) or same phone with different area code. Step 2: Score all pairs of records; report high scores as matches.

Customer Records – (4) Problem: (1 million)2 is too many pairs of records to score. Solution: A simple LSH. Three hash functions: exact values of name, address, phone. Compare iff records are identical in at least one. Misses similar records with a small difference in all three fields.

Aside: Validation of Results
We were able to tell what values of the scoring function were reliable in an interesting way. Identical records had a creation date difference of 10 days. We only looked for records created within 90 days, so bogus matches had a 45-day average.

Validation – (2) By looking at the pool of matches with a fixed score, we could compute the average time-difference, say x, and deduce that fraction (45-x )/35 of them were valid matches. Alas, the lawyers didn’t think the jury would understand.

The End Thanks for Listening

My Favorite Algorithms for Large-Scale Data Mining

Similar presentations

Presentation on theme: "My Favorite Algorithms for Large-Scale Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

My Favorite Algorithms for Large-Scale Data Mining

Similar presentations

Presentation on theme: "My Favorite Algorithms for Large-Scale Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback