Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Sec. 19.6

Near-Duplicate Detection Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]

Desiderata Storage: only small sketches of each document. Computation: the fastest possible Stream Processing : once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

Natural Approaches Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability guarantees MD5 – cryptographically-secure string hashes Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents  similar samples But – even samples of same document will differ

Karp-Rabin Fingerprints Consider – m-bit string A = 1 a 1 a 2 … a m Basic values: Choose a prime p in the universe U ≈ 2 64 Fingerprint: f(A) = A mod p Rolling hash  given B = a 2 … a m a m+1 f(B) = [2 m-1 (A – 2 m - a 1 2 m-1 ) + 2 m + a m+1 ] mod p Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U) < (log (A+B)) / #prime(U) ) ≈ (m log U)/U

Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough

#1. Doc Similarity  Set Intersection Doc B S B SASA Doc A Jaccard measure – similarity of S A, S B Claim: A & B are near-duplicates if sim(S A,S B ) is high We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (further efficiency)

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint #2. Sets of 64-bit fingerprints Fingerprints: Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) In practice, use 64-bit fingerprints, i.e., U=2 64 Prob[collision] ≈ (8q * 64)/2 64 << 1 This reduces space for storing the multi-sets and the time to intersect them, but...

#3. Sketch of a document Sets are large, so their intersection is still too costly Create a “sketch vector” (of size ~200) for each shingle-set Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates Sec. 19.6

Sketching by Min-Hashing Consider S A, S B  {0,…,p-1} Pick a random permutation π of the whole set P (such as ax+b mod p) Define  = min{π(S A )},  = min{π(S B )} minimal element under permutation π Lemma:

Strengthening it… Similarity sketch sk(A) = k minimal elements under π(S A ) We might also take K permutations and the min of each Note : we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1 Document 1 2 64 Start with 64-bit f(shingles) Permute with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 2 64 Are these equal? Test for 200 random permutations:  ,  ,…  200 AB Sec. 19.6

However… Document 1 Document 2 2 64 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union B A Sec. 19.6

#4. Detecting all duplicates Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is executed in internal memory Locality sensitive hashing (LSH) for sk(A)  sk(B) Sample h elements of sk(A) as ID (may induce false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt same h-selection), then A and B are probably near- duplicates (hence compare).

#4. do you implement this? GOAL: If at least one ID matches with another one (wrt same h- selection), then A and B are probably near-duplicates (hence compare). SOL 1: Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()]. Insert the docID in the slots of each ID (using some hash) Scan every bucket and check which docID share >=1 ID. SOL 2: Sort by each ID, and then check the consecutive equal ones.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Similar presentations

Presentation on theme: "Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Similar presentations

Presentation on theme: "Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!"— Presentation transcript:

Similar presentations

About project

Feedback