Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Joins for Near Duplicate Detection
Chuan Xiao The University of New South Wales, Australia Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW), Jeffrey Xu Yu (CUHK)

Outline Introduction Algorithms Experiments Conclusion 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Near Duplicate Data On one end, a winded Pete Sampras tried to summon enough energy to give the New York fans another memorable win to talk about it on the subway ride home. On the other side, Roger Federer wore a sly grin like he knew age was about to catch up to the former world No. 1 - the man who owns the record of 14 Grand Slams he wants. By JAY COHEN, AP Sports Writer Mar 11, 4:23 am EDT 03/11/2008 | 11:28 AM 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Applications For Web search engines: Perform focused crawling Increase the quality and diversity of query results Identify spams. For Web mining: Perform document clustering Find replicate Web collections Detect plagiarism SPAM TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal address attached to ticket number with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

near duplicates = pairs of objects with high similarity similarity -> quantitative way -> similarity function Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such that sim(x,y)>=t Tokenize: Each record is a set of tokens from a finite universe. Suppose each record is a single text document x = “yes as soon as possible” y = “as soon as possible please” x = {A, B, C, D, E} y = {B, C, D, E, F} word yes as soon as1 possbile please token A B C D E F 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Similarity Function Common similarity functions: Jaccard: Cosine: Overlap: Jaccard can be equivalently converted to Overlap x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = 0.67 4/5 = 0.8 4 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Naïve and Index-Based Algorithms
Naïve Algorithm: Compare every pair of objects -> O(n2) time complexity Index-based Algorithm [MIR, SIGMOD04]: inverted lists token record_id A w x y B z … C Record Set Index Construction <w,x> <w,y> <x,y> <x,z> … Candidate Generation Verification Result Pairs 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Index-Based Algorithm
Example Suppose sim(x,y) = O(x,y) >= t = 3 Result: <w,x>, <w,y> stop words too many candidate pairs! Name w Data Mining: Concepts and Techniques x Web Data Mining Techniques y Data Mining: Concepts, Models, Methods, and Algorithms z Data Management Concepts v Romeo and Joliet u The Merchant of Venice 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Prefix Filter [ICDE06, WWW07]
Sort the tokens by a global ordering increasing order of document frequency Index the first few tokens (prefix) for each record Example: suppose sim(x,y) = O(x,y) >= t = 4 x = y = Must share at least one token in prefix to be a candidate pair sorted A B A B C D E C D E E F G uboundO(x,y) = 3 < 4! C D C D A B E E F G E F G sorted prefix 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Prefix Filter [ICDE06, WWW07]
O(x,y) >= t  prefix length = |x| - t + 1 J(x,y) >= t  O(x,y) >= t |x|  prefix length = (1-t) |x| + 1 Example: suppose sim(x,y) = J(x,y) >= t = 0.8 w = {C, D, E, F} x = {B, C, D, E, F} y = {A, B, C, D, F} z = {G, A, B, E, F} Candidate Pairs <w,x>, <x,y>, <y,z> Results <w,x> 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Prefix + Positional Information
We use prefix filter (All-Pairs [www07]) as basic framework Intuition tokens sorted -> rank, or position of tokens within a record estimate tighter upper bounds of overlap between x and y with positional information Contributions index construction index not only tokens, but their positions in the record  ppjoin algorithm candidate generation probe tokens in suffix, compare the positions in the record  ppjoin+ algorithm 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Positional Filter within Prefix (ppjoin)
Index both tokens and their positions position x = y = uboundO(x,y) = 1+ min(|x| - px, |y| - py) 1 2 3 4 5 1 2 B C D E F ubound O(x,y) = 1 + min(4, 3) = 4 A B C D F 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Positional Filter within Suffix (ppjoin+)
probe tokens in suffix, and compare their positions suppose sim(x,y) = J(x,y) >= t = 0.8 |x| = |y| = 18, O(x,y) >= 16 x = y = uboundO(x,y) = = 15 < 16 prefix suffix A B D E Q A C D E Q binary search 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Positional Filter within Suffix (ppjoin+)
Divide and Conquer ubounddep=1 = ubounddep=2 = ubounddep=3 = probe suffix recursively, until either candidate pair is pruned, or reach max-depth prefix suffix A B C D 3 2 3 1 3 2 3 A B C D 4 + 6 + 1 + 7 = 18 4 + 3 + 1 + 1 + 1 + 3 + 1 + 3 = 17 4 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 15 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Effect of Filters sim(x,y) = J(x,y) >= t = 0.8 u = {C, D, E, F} v = {B, C, D, E, F} w = {A, B, C, D, F} x = {G, A, B, E, F} y = {A, B, D, E, F} z = {G, A, C, D, E, F} after prefix filter: <u,v>, <v,w>, <v,y>, <w,x>, <w,y>, <w,z>, <x,y>, <x,z>, <y,z> after ppjoin: <u,v>, <w,y>, <w,z>, <x,z>, <y,z> after ppjoin+ (max-depth = 1): <u,v>, <x,z>, <y,z> real result: <u,v> 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Experiment Settings Algorithms Compared All-Pairs [WWW07] PPJoin PPJoin+ Measure Jaccard, Cosine Candidate Size, Running Time Near Duplicate Web Page Detection compare with shingling [SEQS97] 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Experiment Settings Environment Dataset Pentium D 3.00GHz CPU, 2GB RAM
Debian 4.1, GCC with –O3 Dataset dataset # of records avg. size DBLP (author, title) 0.9M 14.0 ENRON ( ) 0.5M 142.4 DBLP-3GRAM 102.5 TREC-4GRAM (author, title, abstract) 0.35M 866.9 TREC-32shingle 32 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Experiment Results – DBLP, Jaccard
Candidate Pairs Running Time 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Exp. Results – Near Duplicate Web Page Detection
extract qgram and shingles set, and perform similarity join rs = result from TREC-32shingle, rq = result from TREC-4gram Precision = tp / rs = / Recall = tp / rq = / Results: rs rq tp threshold (Jaccard) precision recall time (qgram-allpairs) time (qgram-ppjoin+) time (shingling + ssjoin) 0.95 0.38 0.11 41.98s 11.76s 1.00s 0.90 0.48 0.06 245.03s 43.37s 1.03s 0.85 0.58 0.04 926.54s 202.65s 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Conclusion Contributions New algorithms for set-similarity joins positional filtering within prefix -> ppjoin positional filtering within suffix -> ppjoin+ Features exact outperform existing algorithms integrated with near duplicate Web page detection methods Future Work: other similarity function edit-distance top-k similarity search queries 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Related Work Approximate: LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Shingling: A. Z. Broder. On the resemblence and containment of documents. In SEQS, 1997. Exact: Index-based: S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. Prefix-based: S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Pigeon-hole principle based: PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Thank you! Questions? 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

References [SEQS97] A. Z. Broder. On the resemblance and containment of documents. In SEQS 1997. [MIR] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrival. Addison Wesley, 1st edition, May 1999. [VLDB99] LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. [SIGMOD04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. [ICDE06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. [VLDB06] PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. [WWW07] All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Backup Slides Memory Issues We need twice amount of memory as All-Pairs on building index. Space / Time Some techniques to deal with memory Do not build index for widowed tokens (appear only once) Sort the records are sorted by increasing size; dynamically remove shorter records from inverted lists Integrated with RDBMS Prefix filter in RDBMS [ICDE06] Need to implement positional filters in both prefix and suffix Q: What if the probing tokens are not found in y? Convert overlap to hamming distance Estimate the upper bound of hamming distance 2018/9/19 Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Joins for Near Duplicate Detection

Similar presentations

Presentation on theme: "Efficient Similarity Joins for Near Duplicate Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Similarity Joins for Near Duplicate Detection

Similar presentations

Presentation on theme: "Efficient Similarity Joins for Near Duplicate Detection"— Presentation transcript:

Similar presentations

About project

Feedback