Download presentation
Presentation is loading. Please wait.
Published byLee Mills Modified over 9 years ago
1
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola
2
26/20/2011Detecting Near-Duplicates for Web Crawling Outline De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions
3
36/20/2011Detecting Near-Duplicates for Web Crawling De-duplication The process of eliminating near-duplicate web documents in a generic crawl Challenge of near-duplicates: Identifying exact duplicates is easy Use checksums How to identify near-duplicate? Near-duplicates are identical in content but have differences in small areas Ads, counters, and timestamps
4
46/20/2011Detecting Near-Duplicates for Web Crawling Goal of the Paper Present near-duplicate detection system which improves web crawling Near-duplicate detection system includes: Simhash technique Technique used to transform a web-page to an f-bit fingerprint Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions
5
56/20/2011Detecting Near-Duplicates for Web Crawling Why is De-duplication Important? Elimination of near duplicates: Saves network bandwidth Do not have to crawl content if similar to previously crawled content Reduces storage cost Do not have to store in local repository if similar to previously crawled content Improves quality of search indexes Local repository used for building search indexes not polluted by near-duplicates
6
66/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique Convert web-page to set of features Using Information Retrieval techniques e.g. tokenization, phrase detection Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector Dimension values start at 0 Update f-dimensional vector with weight of feature If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature If i-th bit of hash value is one -> add the weight of the feature to the i- th vector value Vector will have positive and negative components Sign (+/-) of each component are bits for the fingerprint
7
76/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.) Very simple example One web-page Web-page text: “Simhash Technique” Reduced to two features “Simhash”-> weight = 2 “Technique”-> weight = 4 Hash features to 4-bits “Simhash”-> 1101 “Technique”-> 0110
8
86/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.) Start vector with all zeroes 0 0 0 0
9
96/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.) Apply “Simhash” feature (weight = 2) 0 0 0 0 2 2 -2 2 1 1 0 1 feature’s f-bit value 0 + 2 0 - 2 0 + 2 calculation
10
106/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.) Apply “Technique” feature (weight = 4) 2 2 -2 2 6 2 0 1 1 0 feature’s f-bit value 2 - 4 2 + 4 -2 + 4 2 - 4 calculation
11
116/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.) Final vector: Sign of vector values is -,+,+,- Final 4-bit fingerprint = 0110 -2 6 2
12
126/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions Solution: Create tables containing the fingerprints Each table has a permutation ( π ) and a small integer (p) associated with it Apply the permutation associated with the table to its fingerprints Sort the tables Store tables in main-memory of a set of machines Iterate through tables in parallel Find all permutated fingerprints whose top p i bits match the top p i bits of π i (F) For the fingerprints that matched, check if they differ from π i (F) in at most k-bits
13
136/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) Simple example F = 0100 1101 K = 3 Have a collection of 8 fingerprints Create two tables Fingerprints 1100 0101 1111 0101 1100 0111 1110 1111 1110 0000 0001 1111 0101 1101 0010
14
146/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) Fingerprints 1100 0101 1111 0101 1100 0111 1110 1111 1110 0010 0001 1111 0101 1101 0010 p = 3; π = Swap last four bits with first four bits 0101 1100 1111 1100 0101 1110 0111 p = 3; π = Move last two bits to the front 1011 1111 0100 1000 0111 1101 1011 0100
15
156/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) p = 3; π = Swap last four bits with first four bits 0101 1100 1111 1100 0101 1110 0111 p = 3; π = Move last two bits to the front 1011 1111 0100 1000 0111 1101 1011 0100 Sort p = 3; π = Swap last four bits with first four bits 0101 1100 1100 0101 1110 0111 1111 p = 3; π = Move last two bits to the front 0100 1000 0111 1101 1011 0100 1011 1111 Sort
16
166/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) F = 0100 1101 p = 3; π = Swap last four bits with first four bits 0101 1100 1100 0101 1110 0111 1111 p = 3; π = Move last two bits to the front 0100 1000 0111 1101 1011 0100 1011 1111 π (F) = 1101 0100 π (F) = 0101 0011 Match!
17
176/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) With k =3, only fingerprint in first table is a near- duplicate of the F fingerprint p = 3; π = Swap last four bits with first four bits 11010100 11000101 p = 3; π = Move last two bits to the front 01010011 01001000 F
18
186/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Compression of Tables Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous one Append to the block the Huffman code for the position of the most significant 1 bit Append to the block the bits after the most significant 1 bit Repeat steps 2-4 until block is full Comparing to the query fingerprint Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block
19
196/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Extending to Batch Queries Problem: Want to get near-duplicates for batch of query fingerprints – not just one Solution: Use Google File System (GFS) and MapReduce Create two files File F has the collection of fingerprints File Q has the query fingerprints Store the files in GFS GFS breaks up the files into chunks Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found Produce sorted file from output of each task Remove duplicates if necessary
20
206/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Parameters 8 Billion web pages used K = 1 …10 Manually tagged pairs as follows: True positives Differ slightly False positives Radically different pairs Unknown Could not be evaluated
21
216/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Results Accuracy Low k value -> a lot of false negatives High k value -> a lot of false positives Best value -> k = 3 75% of near-duplicates reported 75% of reported cases are true positives Running Time Solution Hamming Distance: O(log(p)) Batch Query + Compression: 32GB File & 200 tasks -> runs under 100 seconds
22
226/20/2011Detecting Near-Duplicates for Web Crawling Related Work Clustering related documents Detect near-duplicates to show related pages Data extraction Determine schema of similar pages to obtain information Plagiarism Detect pages that have borrowed from each other Spam Detect spam before user receives it
23
236/20/2011Detecting Near-Duplicates for Web Crawling Tying it Back to Lecture Similarities Indicated importance of de-duplication to save crawler resources Brief summary of several uses for near-duplicate detection Differences Lecture focus: Breadth-first look at algorithms for near-duplicate detection Paper focus: In-depth look of simhash and Hamming Distance algorithm Includes how to implement and effectiveness
24
246/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Pros Thorough step-by-step explanation of the algorithm implementation Thorough explanation on how the conclusions were reached Included brief description of how to improve simhash + Hamming Distance algorithm Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.
25
256/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Cons No comparison How much more effective or faster is it than other algorithms? By how much did it improve the crawler? Limited batch queries to a specific technology Implementation required use of GFS Approach not restricted to certain technology might be more applicable
26
266/20/2011Detecting Near-Duplicates for Web Crawling Any Questions? ???
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.