Presentation is loading. Please wait.

Presentation is loading. Please wait.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Similar presentations


Presentation on theme: "DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola."— Presentation transcript:

1 DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola

2 26/20/2011Detecting Near-Duplicates for Web Crawling Outline  De-duplication  Goal of the Paper  Why is De-duplication Important?  Algorithm  Experiment  Related Work  Tying it Back to Lecture  Paper Evaluation  Questions

3 36/20/2011Detecting Near-Duplicates for Web Crawling De-duplication  The process of eliminating near-duplicate web documents in a generic crawl  Challenge of near-duplicates:  Identifying exact duplicates is easy Use checksums  How to identify near-duplicate? Near-duplicates are identical in content but have differences in small areas Ads, counters, and timestamps

4 46/20/2011Detecting Near-Duplicates for Web Crawling Goal of the Paper  Present near-duplicate detection system which improves web crawling  Near-duplicate detection system includes:  Simhash technique Technique used to transform a web-page to an f-bit fingerprint  Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions

5 56/20/2011Detecting Near-Duplicates for Web Crawling Why is De-duplication Important?  Elimination of near duplicates:  Saves network bandwidth Do not have to crawl content if similar to previously crawled content  Reduces storage cost Do not have to store in local repository if similar to previously crawled content  Improves quality of search indexes Local repository used for building search indexes not polluted by near-duplicates

6 66/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique  Convert web-page to set of features  Using Information Retrieval techniques e.g. tokenization, phrase detection  Give a weight to each feature  Hash each feature into a f-bit value  Have a f-dimensional vector  Dimension values start at 0  Update f-dimensional vector with weight of feature  If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature  If i-th bit of hash value is one -> add the weight of the feature to the i- th vector value  Vector will have positive and negative components  Sign (+/-) of each component are bits for the fingerprint

7 76/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Very simple example  One web-page Web-page text: “Simhash Technique”  Reduced to two features “Simhash”-> weight = 2 “Technique”-> weight = 4  Hash features to 4-bits “Simhash”-> 1101 “Technique”-> 0110

8 86/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Start vector with all zeroes 0 0 0 0

9 96/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Apply “Simhash” feature (weight = 2) 0 0 0 0 2 2 -2 2 1 1 0 1 feature’s f-bit value 0 + 2 0 - 2 0 + 2 calculation

10 106/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Apply “Technique” feature (weight = 4) 2 2 -2 2 6 2 0 1 1 0 feature’s f-bit value 2 - 4 2 + 4 -2 + 4 2 - 4 calculation

11 116/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Final vector:  Sign of vector values is -,+,+,-  Final 4-bit fingerprint = 0110 -2 6 2

12 126/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem  Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions  Solution:  Create tables containing the fingerprints Each table has a permutation ( π ) and a small integer (p) associated with it Apply the permutation associated with the table to its fingerprints Sort the tables  Store tables in main-memory of a set of machines Iterate through tables in parallel Find all permutated fingerprints whose top p i bits match the top p i bits of π i (F) For the fingerprints that matched, check if they differ from π i (F) in at most k-bits

13 136/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  Simple example  F = 0100 1101  K = 3  Have a collection of 8 fingerprints  Create two tables Fingerprints 1100 0101 1111 0101 1100 0111 1110 1111 1110 0000 0001 1111 0101 1101 0010

14 146/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) Fingerprints 1100 0101 1111 0101 1100 0111 1110 1111 1110 0010 0001 1111 0101 1101 0010 p = 3; π = Swap last four bits with first four bits 0101 1100 1111 1100 0101 1110 0111 p = 3; π = Move last two bits to the front 1011 1111 0100 1000 0111 1101 1011 0100

15 156/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) p = 3; π = Swap last four bits with first four bits 0101 1100 1111 1100 0101 1110 0111 p = 3; π = Move last two bits to the front 1011 1111 0100 1000 0111 1101 1011 0100 Sort p = 3; π = Swap last four bits with first four bits 0101 1100 1100 0101 1110 0111 1111 p = 3; π = Move last two bits to the front 0100 1000 0111 1101 1011 0100 1011 1111 Sort

16 166/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  F = 0100 1101 p = 3; π = Swap last four bits with first four bits 0101 1100 1100 0101 1110 0111 1111 p = 3; π = Move last two bits to the front 0100 1000 0111 1101 1011 0100 1011 1111 π (F) = 1101 0100 π (F) = 0101 0011 Match!

17 176/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  With k =3, only fingerprint in first table is a near- duplicate of the F fingerprint p = 3; π = Swap last four bits with first four bits 11010100 11000101 p = 3; π = Move last two bits to the front 01010011 01001000 F

18 186/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Compression of Tables  Store first fingerprint in a block (1024 bytes)  XOR the current fingerprint with the previous one  Append to the block the Huffman code for the position of the most significant 1 bit  Append to the block the bits after the most significant 1 bit  Repeat steps 2-4 until block is full  Comparing to the query fingerprint  Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block

19 196/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Extending to Batch Queries  Problem: Want to get near-duplicates for batch of query fingerprints – not just one  Solution:  Use Google File System (GFS) and MapReduce Create two files File F has the collection of fingerprints File Q has the query fingerprints Store the files in GFS GFS breaks up the files into chunks Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found Produce sorted file from output of each task Remove duplicates if necessary

20 206/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Parameters  8 Billion web pages used  K = 1 …10  Manually tagged pairs as follows:  True positives Differ slightly  False positives Radically different pairs  Unknown Could not be evaluated

21 216/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Results  Accuracy  Low k value -> a lot of false negatives  High k value -> a lot of false positives  Best value -> k = 3 75% of near-duplicates reported 75% of reported cases are true positives  Running Time  Solution Hamming Distance: O(log(p))  Batch Query + Compression: 32GB File & 200 tasks -> runs under 100 seconds

22 226/20/2011Detecting Near-Duplicates for Web Crawling Related Work  Clustering related documents  Detect near-duplicates to show related pages  Data extraction  Determine schema of similar pages to obtain information  Plagiarism  Detect pages that have borrowed from each other  Spam  Detect spam before user receives it

23 236/20/2011Detecting Near-Duplicates for Web Crawling Tying it Back to Lecture  Similarities  Indicated importance of de-duplication to save crawler resources  Brief summary of several uses for near-duplicate detection  Differences  Lecture focus: Breadth-first look at algorithms for near-duplicate detection  Paper focus: In-depth look of simhash and Hamming Distance algorithm Includes how to implement and effectiveness

24 246/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Pros  Thorough step-by-step explanation of the algorithm implementation  Thorough explanation on how the conclusions were reached  Included brief description of how to improve simhash + Hamming Distance algorithm  Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.

25 256/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Cons  No comparison  How much more effective or faster is it than other algorithms?  By how much did it improve the crawler?  Limited batch queries to a specific technology  Implementation required use of GFS  Approach not restricted to certain technology might be more applicable

26 266/20/2011Detecting Near-Duplicates for Web Crawling Any Questions? ???


Download ppt "DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola."

Similar presentations


Ads by Google