Download presentation
Presentation is loading. Please wait.
Published bySibyl Fowler Modified over 9 years ago
1
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
2
Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Sec. 19.6
3
Near-Duplicate Detection Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]
4
Desiderata Storage: only small sketches of each document. Computation: the fastest possible Stream Processing : once sketch computed, source is unavailable Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do
5
Natural Approaches Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability guarantees MD5 – cryptographically-secure string hashes Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ
6
Karp-Rabin Fingerprints Consider – m-bit string A = 1 a 1 a 2 … a m Basic values: Choose a prime p in the universe U ≈ 2 64 Fingerprint: f(A) = A mod p Rolling hash given B = a 2 … a m a m+1 f(B) = [2 m-1 (A – 2 m - a 1 2 m-1 ) + 2 m + a m+1 ] mod p Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U) < (log (A+B)) / #prime(U) ) ≈ (m log U)/U
7
Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough
8
#1. Doc Similarity Set Intersection Doc B S B SASA Doc A Jaccard measure – similarity of S A, S B Claim: A & B are near-duplicates if sim(S A,S B ) is high We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (further efficiency)
9
Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint #2. Sets of 64-bit fingerprints Fingerprints: Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) In practice, use 64-bit fingerprints, i.e., U=2 64 Prob[collision] ≈ (8q * 64)/2 64 << 1 This reduces space for storing the multi-sets and the time to intersect them, but...
10
#3. Sketch of a document Sets are large, so their intersection is still too costly Create a “sketch vector” (of size ~200) for each shingle-set Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates Sec. 19.6
11
Sketching by Min-Hashing Consider S A, S B {0,…,p-1} Pick a random permutation π of the whole set P (such as ax+b mod p) Define = min{π(S A )}, = min{π(S B )} minimal element under permutation π Lemma:
12
Strengthening it… Similarity sketch sk(A) = k minimal elements under π(S A ) We might also take K permutations and the min of each Note : we can reduce the variance by using a larger k
13
Computing Sketch[i] for Doc1 Document 1 2 64 Start with 64-bit f(shingles) Permute with i Pick the min value Sec. 19.6
14
Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 2 64 Are these equal? Test for 200 random permutations: , ,… 200 AB Sec. 19.6
15
However… Document 1 Document 2 2 64 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union B A Sec. 19.6
16
#4. Detecting all duplicates Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is executed in internal memory Locality sensitive hashing (LSH) for sk(A) sk(B) Sample h elements of sk(A) as ID (may induce false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt same h-selection), then A and B are probably near- duplicates (hence compare).
17
#4. do you implement this? GOAL: If at least one ID matches with another one (wrt same h- selection), then A and B are probably near-duplicates (hence compare). SOL 1: Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()]. Insert the docID in the slots of each ID (using some hash) Scan every bucket and check which docID share >=1 ID. SOL 2: Sort by each ID, and then check the consecutive equal ones.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.