CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Mining Data Streams.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Searching on Multi-Dimensional Data
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
CS 171: Introduction to Computer Science II Hashing and Priority Queues.
Probabilistic Fingerprints for Shapes Niloy J. MitraLeonidas Guibas Joachim GiesenMark Pauly Stanford University MPII SaarbrückenETH Zurich.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
Near Duplicate Detection
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Finding Similar Items.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
Data Structures Hashing Uri Zwick January 2014.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Locality Sensitive Hashing Basics and applications.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Mining Data Streams (Part 1)
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Hashing Alexandra Stefan.
Locality-sensitive hashing and its applications
RE-Tree: An Efficient Index Structure for Regular Expressions
Sketching, Locality Sensitive Hashing
Hashing Alexandra Stefan.
Hash Table.
Lecture 11: Nearest Neighbor Search
Randomized Algorithms CS648
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Minwise Hashing and Efficient Search
On the resemblance and containment of documents (MinHash)
Approximation and Load Shedding Sampling Methods
Lecture-Hashing.
Presentation transcript:

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani

CS 361A 2 Game Plan for Week  Fingerprints  Document Similarity  Shingling  Min-Hashing  Min-Wise Independent Permutations

CS 361A 3 Fingerprints   – set of large objects (e.g., URLs)  Goal avoid storing large objects explicitly quick-and-dirty equality-testing  Fingerprints? Short tags for objects Distinct fingerprints  distinct objects Distinct objects  probably distinct fingerprints

CS 361A 4 Formalization  Fingerprint length k  fingerprint space size N=2 k  Fingerprint function family F = { f :  k }  Random f  R F  f(A)  f(B)  A  Collisions: P[ f(A) = f(B) | A    (ideally 2 O(-k) )  Typical Application Adversarial object-set S with |S| = n << 2 k Goal – |f(S)| = |S| with high probability n 2 pair-wise collisions possible  need 2 k > n 2 (to avoid Birthday Paradox)

CS 361A 5 Example – URL Fingerprints  Search Engines Manage large numbers of URL strings Long, variable strings (embedded objects/database-queries)  Desiderata small/fixed-length encodings – hopefully, unique Some scenarios oExact string irrelevant oOnly need ability to distinguish distinct URLs Even otherwise, unique IDs useful for indexing  Numbers? 4 billion webpages  n=2 32 N  n 2  k=64 Fingerprints  8-byte representation

CS 361A 6 Fingerprinting vs Hashing  Hashing h:  k Set Membership testing for set S of size n Desire uniform distribution over bin address  k Minimize collisions per bin – reduce lookup time Minimize hash table size  n  N=2 k  Fingerprinting f :  k Object Equality testing over set S of size n Distribution over  k is irrelevant Avoid collisions altogether Tolerate larger k – typically N > n 2

CS 361A 7 Fingerprinting Strings  Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs)  Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes orelatively slow oavoids leaking information about original string  Rabin’s Scheme Algebraic technique – polynomial arithmetic Efficient – need (1 table lookup + 1 xor + 1 shift) per byte other nice properties…

CS 361A 8 Rabin Fingerprints  Consider – m-bit string A=a 1 a 2 … a m  Assume – a 1 =1 and fixed-length strings (wlog)  Encoding Strings Degree-m polynomials over Z 2 A(x) = a 1 x m-1 + a 2 x m-2 + … + a m-1 x 1 + a m  Fingerprints P(x): random, irreducible deg-k polynomial over Z 2 (easy to sample such polynomials) irreducible  unlike x 2 +x+1, can factor x 2 +1=(x+1) 2 f(A) = A(x) mod P(x)

CS 361A 9 Analysis  Fix S – n strings of length m  Consider Collision f(A)=f(B)  A(x)=B(x) mod P(x)  Q S =0 mod P(x) Therefore – P(x) is factor of Q S (x)  Collision Probability? degree(Q S ) = n 2 m number of irreducible degree-k factors of Q S (x) is < n 2 m/k Fact: Number of irreducible degree-k polynomials > (2 k -2 k/2 )/k Prob[random P(x) divides Q S (x)] < n 2 m/2 k  Prob [fingerprints not distinct] <

CS 361A 10 Beneficial Properties  Hardware-level implementation Z 2 -polynomials same as strings simple shift-register operations  Distributivity – f(A+B) = f(A) + f(B) over Z 2  Let  = concatenation f(A  B) = f(f(A)  ) f(A  B) = A(x)*t m + B(x) mod P(x)  Fingerprint sliding windows over strings – low incremental cost

CS 361A 11 Duplicate Document Detection  Problem Given – large collection of arbitrary documents Identify – near-duplicate documents  Web search engines Proliferation of near-duplicate documents oLegitimate – mirrors, local copies, updates, … oMalicious – spam, spider-traps, dynamic URLs, … oMistaken – spider errors 30% of web-pages are near-duplicates [Broder et al 1997] Cost – RAM/disk, search quality, unhappy users Enterprise search – even larger amount of duplication SCAM – plagiarism detection [Shivakumar et al 1998]

CS 361A 12 Natural Approaches  Fingerprinting? only works for exact matches here – must identify even near-duplicates  Random Sampling? sample substrings (phrases, sentences, etc) hope: similar documents  similar samples No – even samples of same document will differ  Edit-distance? metric for approximate string-matching expensive – even for one pair of strings impossible – for web documents

CS 361A 13 Desiderata  Storage only small sketches of each document.  Computation O(n log n) time on n documents  Stream Processing once sketch computed, source is unavailable  Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

CS 361A 14 Basic Idea [Broder 1997]  Shingling dissect document into q-grams (shingles) represent documents by shingle-sets near-duplicates  shingle-sets intersection is large reduce problem to set intersection  Set Intersection fingerprints of shingles min-hash to estimate intersections sizes

CS 361A 15 Shingling  Shingle – q contiguous tokens/words (q-gram)  Consider following “document” a rose is a rose is a rose  Choose q=4  get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose

CS 361A 16 Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Fingerprints? Use Rabin fingerprints Fingerprint space U = [0, …, N-1] In practice, use 64-bit fingerprints, i.e., N=2 64 Result – uniformity in length of strings

CS 361A 17 Similarity of Documents Doc B SBSB SASA Doc A Jaccard measure – similarity of S A, S B  U = [0 … N-1] Claim: A & B are near-duplicates if sim(S A,S B ) is high Claim: A is contained in B if con(S A,S B ) is high

CS 361A 18 Remarks  Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision  Shingle Size q ε [3 … 10] Short shingles  increase similarity of unrelated documents oWith q=1, sim(S A,S B ) =1  A is permutation of B oNeed larger q to sensitize to permutation changes Long shingles  small random changes have larger impact  Similarity Measure Similarity is non-transitive, non-metric But – dissimilarity 1- sim(S A,S B ) is a metric [Charikar 02]  [Ukkonen 92] – relate q-gram & edit-distance

CS 361A 19 Example  A = “a rose is a rose is a rose”  B = “a rose is a flower which is a rose”  Preserving multiplicity q=1  sim(S A,S B ) = 0.7 oS A = {a, a, a, is, is, rose, rose, rose} oS B = {a, a, a, is, is, rose, rose, flower, which} q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.3  Disregarding multiplicity q=1  sim(S A,S B ) = 0.6 q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) =

CS 361A 20 Min-Hashing  Consider S A, S B  U Pick – random permutation π of U Define  = π -1 ( min{π(S A )} ) and  = π -1 ( min{π(S B )} ) Meaning? – minimal element under permutation π  Lemma: Let δ = min{ π(S A  S B ) } Claim:  =  π -1 (δ)  S A  S B Clearly

CS 361A 21 Min-Hashing  Similarity Sketches Succinct representation of fingerprint sets S A Allows efficient estimation of sim(S A,S B ) Basic idea – use min-hash of fingerprints  sk(A) = k minimal elements under π(S A )  Claim: E[ sim(sk(A), sk(B)) ] = sim(S A,S B ) For each   sk(A)  sk(B)  Observe sketch-similarity is unbiased estimator of similarity reducing variance – use larger k

CS 361A 22 Remarks  Implementation shingle/fingerprint/sketch document in streams Issue – cost of pairwise comparison of sketches? ocluster sketch-streams [Broder et al, Guha et al] oOpen? – hashing sketches to identify similarity  [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator  [Indyk-Motwani 99] – Locality-Sensitive Hash collisions more likely for similar items Min-Hash is special case

CS 361A 23 Multiple Permutations  Better Variance Reduction Instead of larger k, stick with k=1 Multiple, independent permutations  Sketch Construction Pick p random permutations of U – π 1,π 2, …,π p sk(A) = minimal elements under π 1 (S A ), …, π p (S A )  Claim: E[ sim(sk(A),sk(B)) ] = sim(S A,S B ) Earlier lemma  true for p=1 Linearity of expectations Variance reduction – independence of π 1, …,π p

CS 361A 24 Min-Wise Indep Permutations  Problem Truly-random π over U = [0 … N-1] is infeasible But – do we really need true randomness?  Solution Poly-size family of permutations F  S N over U Choosing/representing random π  F is easy Min-Wise Independence (MWI) Property: For all sets X  U, for all x  F,

CS 361A 25 Minimum-Size MWI Families  [Broder et al 98] Upper/lower bounds of lcm(1,2,…,n) Problem – exponential in N  Approximate MWI Families Relax to Non-constructive – polynomial-size Constructive – size N O(log 1/  ) [Indyk 99]  In practice – 2-universal hashes work well!

CS 361A 26 References I  Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981). Fingerprinting by random polynomials.  Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993). Some applications of Rabin's fingerprinting method.  On the Resemblance and Containment of Documents, A. Broder. SEQUENCES On the Resemblance and Containment of Documents  Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW Syntactic Clustering of the Web  Finding near-replicas of documents on the web. N. Shivakumar and H. Garcia-Molina. WebDB Finding near-replicas of documents on the web.  Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM Identifying and Filtering Near-Duplicate Documents

CS 361A 27 References II  Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992). Approximate String Matching with q-grams and Maximal Matches  Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher. Completeness and Robustness Properties of Min-Wise Independent Permutations  Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000). Min-Wise Independent Permutations  A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA A Small Approximately min-wise Independent Family of Hash Functions.  Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality  Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB Similarity Search in High Dimensions via Hashing  Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC Similarity Estimation Techniques from Rounding Algorithms