SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir.

Slides:

Advertisements

Similar presentations

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Sigir 2008, Singapore.

Advertisements

Near-Duplicates Detection

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Theoretical Analysis. Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 

High Dimensional Search Min-Hashing Locality Sensitive Hashing

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005

Evaluating Search Engine

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Lecture 18 Syntactic Web Clustering CS

1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.

Near Duplicate Detection

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Finding Similar Items.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.

N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea.

Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Outline Problem Background Theory Extending to NLP and Experiment

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald, Jonathan Siddharth, and Andreas Paepcke Dept. of Computer.

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

KNN & Naïve Bayes Hongning Wang

Big Data Infrastructure

Cohesive Subgraph Computation over Large Graphs

Optimizing Parallel Algorithms for All Pairs Similarity Search

Near Duplicate Detection

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Accessing nearby copies of replicated objects

Hash-Based Indexes Chapter 10

Locality Sensitive Hashing

Detecting Phrase-Level Duplication on the World Wide Web

Minwise Hashing and Efficient Search

Hash-Based Indexes Chapter 11

President’s Day Lecture: Advanced Nearest Neighbor Search

Chapter 11 Instructor: Xin Zhang

Three Essential Techniques for Similar Documents

An Efficient Partition Based Method for Exact Set Similarity Joins

Presentation transcript:

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir 2008, Singapore

Near-Duplicate News Articles (I) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Near-Duplicate News Articles (II) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Our Setting June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but   Many different news sites get their core articles delivered by the same sources (e.g., Associated Press)   Even within a news site, often more than 30% of articles are near duplicates (dynamically created content, navigational pages, advertisements, etc.) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

What is SpotSigs? Robust signature extraction – Stopword-based signatures favor natural-language contents of web pages over navigational banners and advertisements Efficient near-duplicate matching – Self-tuning, highly parallelizable clustering algorithm – Threshold-based collection partitioning and inverted index pruning June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case (I): What’s different about the core contents? = stopword occurrences: the, that, {be}, {have} June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case(II): Do not consider for deduplication! no occurrences of: the, that, {be}, {have} June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction “Localized” signatures: n-grams close to a stopword antecedent – E.g.: that:presidential:campaign:hit antecedent nearby n-gram Spot Signature s June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction “Localized” signatures: n-grams close to a stopword antecedent – E.g.: that:presidential:campaign:hit – Parameters: Predefined list of (stopword) antecedents Spot distance d, chain length c   Spot Signatures occur uniformly and frequently throughout any piece of natural-language text   Hardly occur in navigational web page components or ads   Spot Signatures occur uniformly and frequently throughout any piece of natural-language text   Hardly occur in navigational web page components or ads June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Signature Extraction Example Consider the text snippet: “At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” (for antecedents {a, the, is}, uniform spot distance d=1, chain length c=2) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}

Signature Extraction Algorithm Simple & efficient sliding window technique   O(|tokens|) runtime   Largely independent of input format (maybe remove markup)   No expensive and error-prone layout analysis required   O(|tokens|) runtime   Largely independent of input format (maybe remove markup)   No expensive and error-prone layout analysis required June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Choice of Antecedents F1 measure for different antecedents over a “Gold Set” of 2,160 manually selected near-duplicate news articles, 68 clusters June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

June 26, Done?

How to deduplicate a large collection efficiently? Given {S 1,…,S N } Spot Signature sets – For each S i, find all similar signature sets S i1,…,S ik with similarity sim(S i, S ij ) ≥ τ Common similarity measures: – Jaccard, Cosine, Kullback-Leibler, … Common matching algorithms: – Various clustering techniques, similarity hashing, … June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Which documents (not) to compare?   Idea: Two signature sets A, B can only have high (Jaccard) similarity if they are of similar cardinality! Given 3 Spot Signature sets: A with |A| = 345 B with |B| = 1045 C with |C| = 323 Which pairs would you compare first? Which pairs could you spare? ? June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Upper bound for Jaccard Consider Jaccard similarity Upper bound   Never compare signature sets A, B with |A|/|B| (1-τ) |B|   Never compare signature sets A, B with |A|/|B| (1-τ) |B| (for |B|≥|A|, w.l.o.g.) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Multi-set Generalization Consider weighted Jaccard similarity Upper bound   Still skip pairs A, B with |B|-|A| > (1-τ) |B| |A||B| June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Partitioning the Collection #sigs per doc … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions Given a similarity threshold τ, there is no contiguous partitioning (based on signature set lengths), s.t. (A) any potentially similar pair is within the same partition, and (B) any non-similar pair cannot be within the same partition June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SiSi SjSj ?

Partitioning the Collection #sigs per doc … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions Given a similarity threshold τ, there is no contiguous partitioning (based on signature set lengths), s.t. (A) any potentially similar pair is within the same partition, and (B) any non-similar pair cannot be within the same partition June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SiSi SjSj Also: Partition widths should be a function of τ

Optimal Partitioning Given τ, find partition boundaries p 0,…,p k, s.t. (A) all similar pairs (based on length) are mapped into at most two neighboring partitions (no false negatives) (B) no non-similar pair (based on length) is mapped into the same partition (no false positives) (C) all partitions’ widths are minimized w.r.t. (A) & (B) (minimality)   But expensive to solve exactly … June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Approximate Solution   Converges to optimal partitioning when distribution is dense   Web collections typically skewed towards shorter document lengths   Progressively increasing bucket widths are even beneficial for more uniform bucket sizes (next slide!)   Converges to optimal partitioning when distribution is dense   Web collections typically skewed towards shorter document lengths   Progressively increasing bucket widths are even beneficial for more uniform bucket sizes (next slide!) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections “Starting with p 0 = 1, for any given p k, choose p k+1 as the smallest integer p k+1 > p k s.t. p k+1 − p k > (1 − τ )p k+1 ” E.g. (for τ=0.7): p 0 =1, p 1 =3, p 2 =6, p 3 =10,…, p 7 =43, p 8 =59,…

Partitioning Effects   Optimal partitioning approach even smoothes skewed bucket sizes (plot for 1,274,812 TREC WT10g docs with at least 1 Spot Signature) Uniform bucket widthsProgressively increasing bucket widths #docs #sigs #docs #sigs June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but Comparisons within partitions still quadratic!   Can do better: – Create auxiliary inverted indexes within partitions – Prune inverted index traversals using the very same threshold-based pruning condition as for partitioning   Can do better: – Create auxiliary inverted indexes within partitions – Prune inverted index traversals using the very same threshold-based pruning condition as for partitioning June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Inverted Index Pruning Pass 1: – For each partition, create an inverted index: For each Spot Signature s j – Create inverted list L j with pointers to documents d i containing s j – Sort inverted list in descending order of freq i (s j ) in d i Pass 1: – For each partition, create an inverted index: For each Spot Signature s j – Create inverted list L j with pointers to documents d i containing s j – Sort inverted list in descending order of freq i (s j ) in d i Pass 2: – For each document d i, find its partition, then: Process lists in descending order of |L j | Maintain two thresholds: δ 1 – Minimum length distance to any document in the next list δ 2 – Minimum length distance to next document within the current list Break if δ 1 + δ 2 > (1- τ )|d i |, also iterate into right neighbor partition Pass 2: – For each document d i, find its partition, then: Process lists in descending order of |L j | Maintain two thresholds: δ 1 – Minimum length distance to any document in the next list δ 2 – Minimum length distance to next document within the current list Break if δ 1 + δ 2 > (1- τ )|d i |, also iterate into right neighbor partition June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections the:campaign an:attack d 7 :8 d 6 :6d 7 :4 … …d 2 :6d 5 :3d 1 :3 d 1 :5d 5 :4 Partition k

Deduplication Example Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 → sim(d 1,d 3 ) = 0.8 2) d 1 =d 1 → continue 3) δ 1 =4, δ 2 =0 → break! June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections pkpk p k+1 … … d 3 :5 d 2 :8d 3 :4d 1 :5 d 1 :4 Partition k s3s3 s1s1 d 3 :5d 3 :4d 1 :4 s2s2 δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4}, |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ)|d i |

SpotSigs Deduplication Algorithm   Still O(n 2 m) worst case runtime   Empirically much better, may outperform hashing   Tuning parameters: none   Still O(n 2 m) worst case runtime   Empirically much better, may outperform hashing   Tuning parameters: none June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections   See paper for more details!

Experiments Collections – “Gold Set” of 2,160 manually selected near-duplicate news articles from various news sites, 68 clusters – TREC WT10g reference collection (1.6 Mio docs) Hardware – Dual Xeon 3GHz, 32 GB RAM – 8 threads for sorting, hashing & deduplication For all approaches – Remove HTML markup – Simple IDF filter for signatures, remove most frequent & infrequent signatures June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Competitors Shingling [Broder, Glassman, Manasse & Zweig ‘97] – N-gram sets/vectors compared with Jaccard/Cosine similarity   in between O (n 2 m) and O (n m) runtime (using LSH for matching) I-Match [Chowdhury, Frieder, Grossman & McCabe ‘02] – Employs a single SHA-1 hash function – Hardly tunable   O (n m) runtime Locality Sensitive Hashing (LSH) [Indyk, Gionis & Motwani ‘99], [Broder et al. ‘03] – Employs l (random) hash functions, each concatenating k MinHash signatures – Highly tunable   O (k l n m) runtime Hybrids of I-Match and LSH with Spot Signatures (I-Match-S & LSH-S) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

“Gold Set” of News Articles Manually selected set of 2,160 near-duplicate news articles (LA Times, SF Chronicle, Huston Chronicle, etc.), manually clustered into 68 topic directories Huge variations in layout and ads added by different sites Macro- Avg. Cosine ≈0.64 June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarityUsing Cosine similarity (no pruning!) June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Runtime Results – TREC WT10g June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base

Tuning I-Match & LSH on the Gold Set   SpotSigs does not need this tuning step! June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Tuning I-Match-S: varying IDF thresholdsTuning LSH-S: varying #hash functions l for fixed #MinHashes k = 6

Summary – Gold Set June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots (τ = 0.44 for SpotSigs & LSH)

Summary – TREC WT10g June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base at τ = 1.0 and τ = 0.9

Conclusions & Outlook   Robust Spot Signatures favor natural-language page components   Full-fledged clustering algorithm, returns complete graph of all near-duplicate pairs   Efficient & self-tuning collection partitioning and inverted index pruning, highly parallelizable deduplication step   Surprising: May outperform linear-time similarity hashing approaches for reasonably high similarity thresholds   Future Work: – Efficient (sequential) index structures for disk-based storage – Tight bounds for more similarity metrics, e.g., Cosine measure June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Related Work   Shingling [Broder, Glassman, Manasse & Zweig ‘97], [Broder ‘00], [Hod & Zobel ‘03]   Random Projection [Charikar ‘02], [Henzinger ‘06]   Signatures & Fingerprinting [Manbar ‘94], [Brin, Davis & Garcia-Molina ‘95], [Shivakumar ‘95], [Manku ‘06]   Constraint-based Clustering [Klein, Kamvar & Manning ‘02], [Yang & Callan ‘06]   Similarity Hashing I-Match: [Chowdhury, Frieder, Grossman & McCabe ‘02], [Chowdhury ‘04] LSH: [Indyk & Motwani ‘98], [Indyk, Gionis & Motwani ‘99] MinHashing: [Indyk ‘01], [Broder, Charikar & Mitzenmacher ‘03]   Various filtering techniques Entropy-based: [Büttcher & Clarke ‘06] IDF, rules & constraints, … June 26, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections