Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari
Overview : Why Do we care ? Purpose of the paper. Proposed solution for finding near duplicates Pros Cons Future Research.
Why Do We Care? Why do we want to detect near-duplicates? Save storage Search quality Web mirrors Clustering for “related documents” query Data extraction Plagiarism Spam detection Duplicates in domain-specific corpora
Purpose of The Paper? This paper addresses the following issues: Finding near duplicates on the web. Handling the scale of the web Tens of billions of documents indexed Millions of pages crawled every day Which features to be selected while detecting duplicates algorithm for single query and batch processing Survey of other techniques in this field
What are Near-Duplicates? Identical content, but differ in small portion of document Advertisements Counters Timestamps
Simplified Crawl Architecture Web one document HTML Document traverse Web Index links Near-duplicate? entire index newly-crawled document(s) Yes No trash insert
Feature-set per document Shingles from page content Connectivity information Anchor text, anchor window Phrases Document vector from page content - case-folding - stop-word removal, - stemming - computing term-frequencies and weighing each term by its inverse document frequency
Simhash Dimensionality-reduction technique Obtain f-bit fingerprint for each document A pair of documents are near duplicate if and only if fingerprints at most k-bits apart Experimental results show that f=64 & k=3 is good for detecting near duplicates.
Simhash feature, weight hash, weight w1 w1 w2 w2 wn wn 100110 w1 -w1 -w1 w1 w1 -w1 w2 110000 w2 w2 w2 -w2 -w2 -w2 -w2 Doc. wn 001001 wn -wn -wn wn -wn -wn wn add sign 13,108,-22,-5,-32,55 110001 fingerprint
Pre-sorted fingerprints in S Method One Pre-sorted fingerprints in S Exact Probes 64-bit Q All Q’: hd(Q,Q’)≤k=3 ( ) probes! 64 3
S’: All fingerprints at most k-bits away from S Method Two Fingerprints in S S’: All fingerprints at most k-bits away from S Exact Probes 64-bit Q (Sort) |S’| ≈ |S| ( ) 64 3
Final implementation Observation 1: Consider 2d f-bit fingerprints in sorted order Most 2d combinations in d most significant bits exist Can quickly do exact probe on first d’ (≤d) bits Observation 2: Q’ hd(Q,Q’) = 3 Q exact match!
Example exact search on 16 bits 16-bit Q1 Q2 A B C D 64-bit Q Q1 Q2 Q3 Fingerprints in S
Example: Analysis 64-bits split into 4 pieces 4 tables with permuted fingerprints Exact search on 16 bits If 234 (≈10 billion) fingerprints Each probe gives 234-16 fingerprints
Batch Algorithm Tens of billions of pages indexed Crawl millions of pages each day Quickly find all new pages having a near-duplicate in the index
MapReduce Framework MapReduce framework used within Google Map phase: massively parallel Map phase: operate individually on a set of objects Reduce phase aggregate results of the mapped objects
Batch Algorithm Suppose 8B existing fingerprints (~32GB after compression): File F 1M batch query fingerprints (~8MB): File B F stored in a GFS file system chunked into roughly 64MB replicated at 3 random nodes B stored with much higher replication factor
Batch Algorithm (continued) Map Phase: Duplicate detection within each chunk Fi and whole of B Build multiple tables for B (in memory) Scan Fi and probe into B Output near-duplicates in B Reduce phase Merge outputs
Pros Addressed near-duplicate detection in a web-crawling system Proposed algorithms for single and batch cases Experiments to validate the suitability of simhash Mini-survey of near-duplicate detection techniques in the paper
Cons Weight Selection for feature set Handling of continuously changing IDF How to find near duplicates when data is present in different formats Inadequate results
References G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp. 141-150, 2007. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002. J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec. 2004. Articles from Wikipedia etc.
Future Research Considering document size while detecting near duplicates Pruning the space of existing fingerprints Categorizing web pages Removal of portions of web pages with ads and time stamps
Q & A