Download presentation
Presentation is loading. Please wait.
1
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/2011 1 Udeshi-CS572
2
Introduction There are various duplicate documents on the web. Many pages differ in small portion because of advertisement displayed and so on. Such pages are irrelevant for crawling point of you. This paper uses Charikar‘s finnger-printing technique for the same to find out duplicate documents. This technique is useful for both online queries and batch queries. 6/28/20112Udeshi-CS572
3
Advantages of duplicate detection Saves B.W. Reduction in storage cost Improve quality of search engine Reduces load on remote host. 6/28/20113Udeshi-CS572
4
Limitations of duplicate detection Scaling Speed Use less resources 6/28/20114Udeshi-CS572
5
FINGERPRINTING WITH SIMHASH Extract set of features from a document along with corresponding weight for each feature. We use simhash to generate an f-bit finger-print based on presence or absence of feature in a given document. When we use simhash, 64-it finger-print will be good enough for 8B we pages. 6/28/20115Udeshi-CS572
6
Idea behind using Simhash algorithm Simhash has 2 properties : A : The fingerprint of a document is hash of its features. B :Similar documents have similar hash values. Our algorithms are designed assuming that Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets. 6/28/20116Udeshi-CS572
7
Hamming Distance problem Consider a collection of 8B 64-bit fingerprints, occupying 64GB. We have to decide whether existing 8B 64-bit fingerprints differs from F in at most k = 3 bit- positions. Algorithm is different for online queries and batch queries. 6/28/20117Udeshi-CS572
8
Algorithm for online queries We have to build t tables: T1, T2,……. Tt. Table Ti is constructed by applying permutation to each existing fingerprints. There are 2 steps for the same : Identify all permuted fingerprints in Ti whose top bit-positions match the other fingerprints top bit- positions. After following the above step, check if it differs from other by at most k bit-positions. 6/28/20118Udeshi-CS572
9
Design parameters for the algorithm There is a trade-off between number of tables and selecting value of Pi for the table. Increasing the number of tables increases Pi and hence reduces the query time. De-creasing the number of tables reduces storage requirements, but reduces Pi and thus increases the query time. 6/28/20119Udeshi-CS572
10
Algorithm for Batch Queries Files are first broken into 64 MB chunks. Each chunk is replicated at three randomly chosen machines in a cluster. Each chunk is stored as a file in the local system. First, we solve hamming distance problem for each 64MB chunk. Later on, we combine output from all the chunks to produce final output. 6/28/201110Udeshi-CS572
11
Broder's shingle-based fingerprints Broder shingle-based finger-print uses Rabin fingerprints. The algorithm is such that Given an n-bit message m 0,...,m n-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x). 6/28/2011Udeshi-CS57211
12
Comparison with Broder's shingle-based fingerprints For the comparison, 6 Rabin fingerprints are calculated. Later on, it is checked to see if 2 or more finger-prints are matching or not. Each finger-print takes approximately 24 bytes. On the other hand, simhash will take 64-bits for 8B web pages. 6/28/2011Udeshi-CS57212
13
Experimental Results There is a tradeoff between f and k for detection of duplicates for web pages using simhash. Topics includes : Choice of parameters Distribution of finger-prints Scalability 6/28/201113Udeshi-CS572
14
Choice of parameters Vary K between 1 to 10. Divide pages into different categories False Positive True Positive Unknown There is a trade-off. K=3 gives reasonable result for 64-bit finger- print. 6/28/201114Udeshi-CS572
15
Distribution of finger-print (1) Left side of the slide doesn’t drop rapidly as the right side one. This is due to the fact that some pages are similar to each other. So, finger prints differ by moderate number. 6/28/201115Udeshi-CS572
16
Distribution of finger-print (2) More or less uniform with spikes in some places. Reasons: Empty pages. File not found. Multiple websites uses similar login page. 6/28/201116Udeshi-CS572
17
Nature of Corpus: System is mainly divided into 4 documents : Web pages. Files in file system E-mail Domain-specific Corpora This paper mainly involves finding near duplicates for web pages. 6/28/201117Udeshi-CS572
18
Scalability For batch mode, compressed version of file Q occupies almost 32GB. Usually, computational time for each file is approximately 1GBps. So, Computation usually finishes in 100 seconds. 6/28/2011Udeshi-CS57218
19
Need to detect duplicates Web Mirror Clustering for related documents query Data Extraction Plagiarism Spam Detection Duplicate in domain specific corpora 6/28/201119Udeshi-CS572
20
Feature set per-documents Shingles from page content Document vector from page content Connectivity information Anchor text and anchor window Phrases 6/28/201120Udeshi-CS572
21
Future Research Can we categorize web-pages into categories and search for near duplicates only within the relevant categories. Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp. Change sensitivity of simhash algorithm for feature selection and assignment of weights to features. Algorithm for clustering of the documents. Can we categories documents based on languages. 6/28/201121Udeshi-CS572
22
Thank you. Q & A ? 6/28/201122Udeshi-CS572
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.