Download presentation
Presentation is loading. Please wait.
Published byPeter Horn Modified over 9 years ago
1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke SIGIR. 2008 國立雲林科技大學 National Yunlin University of Science and Technology
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Methodology Experiments Conclusion Comments 2
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte- by-byte comparisons fail. 3
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus. 4
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE EXTRACTION MATCHING 5 Web Database Web Database document
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE EXTRACTION A = {aj(dj, cj)} 6 Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Result S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE MATCHING Jaccard Similarity for Sets 7 Generalization for Multi-Sets
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology SPOT SIGNATURE MATCHING 8 SPOT SIGNATURE partition Inverted Index Pruning Inverted Index Pruning Jaccard Similarity for Sets Jaccard Similarity for Sets
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Optimal Partitioning 9
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Inverted Index Pruning 10 Example d1 = {s1:5, s2:4, s3:4}, with |d1| = 13 d2 = {s1:8, s2:4}, |d2| = 12 d3 = {s1:4, s2:5, s3:5}, |d3| = 14 τ = 0.8 δ1 = 0 δ2 = |d1| − |d3| = −1 SPOT SIGNATURE partition Inverted Index Pruning Inverted Index Pruning Jaccard Similarity for Sets Jaccard Similarity for Sets
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Gold Set of Near Duplicate News Articles SpotSigs vs. Shingling Choice of Spot Signatures SpotSigs vs. Hashing TREC WT10g SpotSigs vs. Hashing 11
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Gold Set of Near Duplicate News Articles 12 SpotSigs vs. Shingling Choice of Spot Signatures SpotSigs vs. Hashing
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments TREC WT10g SpotSigs vs. Hashing 13
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion MAJOR CINTRIBUTION SpotSigs proved to provide both increased robustness of signatures as well as highly efficient deduplication compared to various state-of-the- art approaches. FUTURE WORK Future work will focus on efficient access to disk-based index structures, as well as generalizing the bounding approach toward other metrics such as Cosine. 14
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage The SpotSigs deduplication algorithm runs “right out of the box” without the need for further tuning, while remaining exact and efficient. Drawback ….. Application information retrieval 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.