1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

Slides:

Advertisements

Similar presentations

Lecture outline Nearest-neighbor search in low dimensions

Advertisements

Applications Shingling Minhashing Locality-Sensitive Hashing

Hash-Based Improvements to A-Priori

1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.

Similarity and Distance Sketching, Locality Sensitive Hashing

Data Mining of Very Large Data

Near-Duplicates Detection

1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.

Data Structures Using C++ 2E

High Dimensional Search Min-Hashing Locality Sensitive Hashing

1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.

Min-Hashing, Locality Sensitive Hashing Clustering

1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.

1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Association Rule Mining

My Favorite Algorithms for Large-Scale Data Mining

Data Mining: Associations

1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.

Improvements to A-Priori

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 20 (Dec 7, 2005) Data Mining: Association Rules Rajeev Motwani (partially based on notes.

Improvements to A-Priori

Distance Measures LS Families of Hash Functions S-Curves

Near Duplicate Detection

Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.

1 More Stream-Mining Counting How Many Elements Computing “Moments”

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

More Stream-Mining Counting Distinct Elements Computing “Moments”

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Finding Similar Items.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.

1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.

1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.

1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.

CSE 373 Data Structures Lecture 15

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.

1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

1 Low-Support, High-Correlation Finding rare, but very similar items.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

DATA MINING LECTURE 5 MinHashing, Locality Sensitive Hashing, Clustering.

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Shingling Minhashing Locality-Sensitive Hashing

CS276A Text Information Retrieval, Mining, and Exploitation

Data Structures Using C++ 2E

Near Duplicate Detection

Finding Similar Items Jeffrey D. Ullman Application: Similar Documents

Streaming & sampling.

Sketching, Locality Sensitive Hashing

Data Structures Using C++ 2E

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Shingling Minhashing Locality-Sensitive Hashing

Theory of Locality Sensitive Hashing

Hash-Based Improvements to A-Priori

Near Neighbor Search in High Dimensional Data (1)

Three Essential Techniques for Similar Documents

Locality-Sensitive Hashing

Presentation transcript:

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing

2 The Problem uRather than finding high-support item- pairs in basket data, look for items that are highly “correlated.” wIf one appears in a basket, there is a good chance that the other does. w“Yachts and caviar” as itemsets: low support, but often appear together.

3 Correlation Versus Support uA-Priori and similar methods are useless for low-support, high-correlation itemsets. uWhen support threshold is low, too many itemsets are frequent. wMemory requirements too high. uA-Priori does not address correlation.

4 Matrix Representation of Item/Basket Data uColumns = items. uRows = baskets. uEntry (r, c ) = 1 if item c is in basket r ; = 0 if not. uAssume matrix is almost all 0’s.

5 In Matrix Form mcpbj {m,c,b}11010 {m,p,b}10110 {m,b}10010 {c,j}01001 {m,p,j}10101 {m,c,b,j}11011 {c,b,j}01011 {c,b}01010

6 Applications --- (1) uRows = customers; columns = items. w(r, c ) = 1 if and only if customer r bought item c. wWell correlated columns are items that tend to be bought by the same customers. wUsed by on-line vendors to select items to “pitch” to individual customers.

7 Applications --- (2) uRows = (footprints of) shingles; columns = documents. w(r, c ) = 1 iff footprint r is present in document c. wFind similar documents, as in Anand’s 10/10 lecture.

8 Applications --- (3) uRows and columns are both Web pages. w(r, c) = 1 iff page r links to page c. wCorrelated columns are pages with many of the same in-links. wThese pages may be about the same topic.

9 Assumptions --- (1) 1.Number of items allows a small amount of main-memory/item. uE.g., main memory = Number of items * Too many items to store anything in main-memory for each pair of items.

10 Assumptions --- (2) 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket.

11 From Correlation to Similarity uStatistical correlation is too hard to compute, and probably meaningless. wMost entries are 0, so correlation of columns is always high. uSubstitute “similarity,” as in shingles- and-documents study.

12 Similarity of Columns uThink of a column as the set of rows in which it has 1. uThe similarity of columns C 1 and C 2 = Sim (C 1,C 2 ) = is the ratio of the sizes of the intersection and union of C 1 and C 2. wSim (C 1,C 2 ) = |C 1  C 2 |/|C 1  C 2 | = Jaccard measure.

13 Example C 1 C Sim (C 1, C 2 ) = 002/5 =

14 Outline of Algorithm 1.Compute “signatures” (“sketches”) of columns = small summaries of columns. wRead from disk to main memory. 2.Examine signatures in main memory to find similar signatures. wEssential: similarity of signatures and columns are related. 3.Check that columns with similar signatures are really similar (optional).

15 Signatures uKey idea: “hash” each column C to a small signature Sig (C), such that: 1.Sig (C) is small enough that we can fit a signature in main memory for each column. 2.Sim (C 1, C 2 ) is the same as the “similarity” of Sig (C 1 ) and Sig (C 2 ).

16 An Idea That Doesn’t Work uPick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows. uBecause the matrix is sparse, many columns would have as a signature, yet be very dissimilar because their 1’s are in different rows.

17 Four Types of Rows uGiven columns C 1 and C 2, rows may be classified as: C 1 C 2 a11 b10 c01 d00 uAlso, a = # rows of type a, etc. uNote Sim (C 1, C 2 ) = a /(a +b +c ).

18 Minhashing uImagine the rows permuted randomly. uDefine “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. uUse several (100?) independent hash functions to create a signature.

19 Minhashing Example Input matrix Signature matrix M

20 Surprising Property uThe probability (over all permutations of the rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1, C 2 ). uBoth are a /(a +b +c )! uWhy? wLook down columns C 1 and C 2 until we see a 1. wIf it’s a type a row, then h (C 1 ) = h (C 2 ). If a type b or c row, then not.

21 Similarity for Signatures uThe similarity of signatures is the fraction of the rows in which they agree. wRemember, each row corresponds to a permutation or “hash function.”

22 Min Hashing – Example Input matrix Signature matrix M Similarities: Col.-Col Sig.-Sig

23 Minhash Signatures uPick (say) 100 random permutations of the rows. uThink of Sig (C) as a column vector. uLet Sig (C)[i] = row number of the first row with 1 in column C, for i th permutation.

24 Implementation --- (1) uNumber of rows = 1 billion. uHard to pick a random permutation from 1…billion. uRepresenting a random permutation requires billion entries. uAccessing rows in permuted order is tough! wThe number of passes would be prohibitive.

25 Implementation --- (2) 1.Pick (say) 100 hash functions. 2.For each column c and each hash function h i, keep a “slot” M (i, c ) for that minhash value. 3.for each row r, and for each column c with 1 in row r, and for each hash function h i do if h i (r ) is a smaller value than M (i, c ) then M (i, c ) := h i (r ). uNeeds only one pass through the data.

26 Example RowC1C h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120

27 Comparison with “Shingling” uThe shingling paper proposed using one hash function and taking the first 100 (say) values. uAlmost the same, but: wFaster --- saves on hash-computation. wAdmits some correlation among rows of the signatures.

28 Candidate Generation uPick a similarity threshold s, a fraction < 1. uA pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows. wI.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

29 The Problem with Checking Candidates uWhile the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. uExample: 10 6 columns implies 5*10 11 comparisons. uAt 1 microsecond/comparison: 6 days.

30 Solutions 1.DCM method (Anand’s 10/10 slides) relies on external sorting, so several passes over the data are needed. 2.Locality-Sensitive Hashing (LSH) is a method that can be carried out in main memory, but admits some false negatives.

31 Locality-Sensitive Hashing uUnrelated to “minhashing.” uOperates on signatures. uBig idea: hash columns of signature matrix M several times. uArrange that similar columns are more likely to hash to the same bucket. uCandidate pairs are those that hash at least once to the same bucket.

32 Partition into Bands uDivide matrix M into b bands of r rows. uFor each band, hash its portion of each column to k buckets.  Candidate column pairs are those that hash to the same bucket for ≥ 1 band. uTune b and r to catch most similar pairs, few nonsimilar pairs.

33 Simplifying Assumption uThere are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. uHereafter, we assume that “same bucket” means “identical.”

34 Matrix M r rows b bands Buckets

35 Example uSuppose 100,000 columns. uSignatures of 100 integers. uTherefore, signatures take 40Mb. uBut 5,000,000,000 pairs of signatures can take a while to compare. uChoose 20 bands of 5 integers/band.

36 Suppose C 1, C 2 are 80% Similar uProbability C 1, C 2 identical in one particular band: (0.8) 5 = uProbability C 1, C 2 are not similar in any of the 20 bands: ( ) 20 = wi.e., we miss about 1/3000th of the 80%- similar column pairs.

37 Suppose C 1, C 2 Only 40% Similar uProbability C 1, C 2 identical in any one particular band: (0.4) 5 =  Probability C 1, C 2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2. uSmall probability C 1, C 2 not identical in a band, but hash to the same bucket.  But false positives much lower for similarities << 40%.

38 LSH --- Graphically u Example Target: All pairs with Sim > 60%. uSuppose we use only one hash function: s 1.0 Sim Prob. 1.0 s Sim Prob Ideal LSH (partition into bands) gives us: Sim0.0 Prob. 1.0 s

39 LSH Summary uTune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. uCheck in main memory that candidate pairs really do have similar signatures. uThen, in another pass through data, check that the remaining candidate pairs really are similar columns.

40 New Topic: Hamming LSH uAn alternative to minhash + LSH. uTakes advantage of the fact that if columns are not sparse, random rows serve as a good signature. uTrick: create data matrices of exponentially decreasing sizes, increasing densities.

41 Amplification of 1’s uHamming LSH constructs a series of matrices, each with half as many rows, by OR-ing together pairs of rows. uCandidate pairs from each matrix have (say) between 20% - 80% 1’s and are similar in selected 100 rows.  20%-80% OK for similarity thresholds ≥ 0.5. Otherwise, two “similar” columns could fail to both be in range for at least one matrix.

42 Example

43 Using Hamming LSH uConstruct the sequence of matrices. wIf there are R rows, then log 2 R matrices. wTotal work = twice that of reading the original matrix. uUse standard LSH to identify similar columns in each matrix, but restricted to columns of “medium” density.