DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Chapter 5: Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Near-Duplicates Detection
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung.
Australian Document Computing Conference Dec Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Safeguarding and Charging for Information on the Internet Hector Garcia-Molina, Steven P. Ketchpel, Narayanan Shivakumar Stanford University Presented.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
1/13/2003Approximate Object Location and Spam Filtering on Tapestry1 Feng Zhou Li Zhuang
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Databases & Data Warehouses Chapter 3 Database Processing.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Chapter 10 Applications of Arrays and Strings. Chapter Objectives Learn how to implement the sequential search algorithm Explore how to sort an array.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Outline Problem Background Theory Extending to NLP and Experiment
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Information Retrieval in Practice
Indexing & querying text
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Finding replicated web collections
The Search Engine Architecture
Information Retrieval and Web Design
Presentation transcript:

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola

26/20/2011Detecting Near-Duplicates for Web Crawling Outline  De-duplication  Goal of the Paper  Why is De-duplication Important?  Algorithm  Experiment  Related Work  Tying it Back to Lecture  Paper Evaluation  Questions

36/20/2011Detecting Near-Duplicates for Web Crawling De-duplication  The process of eliminating near-duplicate web documents in a generic crawl  Challenge of near-duplicates:  Identifying exact duplicates is easy Use checksums  How to identify near-duplicate? Near-duplicates are identical in content but have differences in small areas Ads, counters, and timestamps

46/20/2011Detecting Near-Duplicates for Web Crawling Goal of the Paper  Present near-duplicate detection system which improves web crawling  Near-duplicate detection system includes:  Simhash technique Technique used to transform a web-page to an f-bit fingerprint  Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions

56/20/2011Detecting Near-Duplicates for Web Crawling Why is De-duplication Important?  Elimination of near duplicates:  Saves network bandwidth Do not have to crawl content if similar to previously crawled content  Reduces storage cost Do not have to store in local repository if similar to previously crawled content  Improves quality of search indexes Local repository used for building search indexes not polluted by near-duplicates

66/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique  Convert web-page to set of features  Using Information Retrieval techniques e.g. tokenization, phrase detection  Give a weight to each feature  Hash each feature into a f-bit value  Have a f-dimensional vector  Dimension values start at 0  Update f-dimensional vector with weight of feature  If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature  If i-th bit of hash value is one -> add the weight of the feature to the i- th vector value  Vector will have positive and negative components  Sign (+/-) of each component are bits for the fingerprint

76/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Very simple example  One web-page Web-page text: “Simhash Technique”  Reduced to two features “Simhash”-> weight = 2 “Technique”-> weight = 4  Hash features to 4-bits “Simhash”-> 1101 “Technique”-> 0110

86/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Start vector with all zeroes

96/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Apply “Simhash” feature (weight = 2) feature’s f-bit value calculation

106/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Apply “Technique” feature (weight = 4) feature’s f-bit value calculation

116/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Simhash Technique (cont.)  Final vector:  Sign of vector values is -,+,+,-  Final 4-bit fingerprint =

126/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem  Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions  Solution:  Create tables containing the fingerprints Each table has a permutation ( π ) and a small integer (p) associated with it Apply the permutation associated with the table to its fingerprints Sort the tables  Store tables in main-memory of a set of machines Iterate through tables in parallel Find all permutated fingerprints whose top p i bits match the top p i bits of π i (F) For the fingerprints that matched, check if they differ from π i (F) in at most k-bits

136/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  Simple example  F =  K = 3  Have a collection of 8 fingerprints  Create two tables Fingerprints

146/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) Fingerprints p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front

156/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.) p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front Sort p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front Sort

166/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  F = p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front π (F) = π (F) = Match!

176/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Solution to Hamming Distance Problem (cont.)  With k =3, only fingerprint in first table is a near- duplicate of the F fingerprint p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front F

186/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Compression of Tables  Store first fingerprint in a block (1024 bytes)  XOR the current fingerprint with the previous one  Append to the block the Huffman code for the position of the most significant 1 bit  Append to the block the bits after the most significant 1 bit  Repeat steps 2-4 until block is full  Comparing to the query fingerprint  Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block

196/20/2011Detecting Near-Duplicates for Web Crawling Algorithm: Extending to Batch Queries  Problem: Want to get near-duplicates for batch of query fingerprints – not just one  Solution:  Use Google File System (GFS) and MapReduce Create two files File F has the collection of fingerprints File Q has the query fingerprints Store the files in GFS GFS breaks up the files into chunks Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found Produce sorted file from output of each task Remove duplicates if necessary

206/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Parameters  8 Billion web pages used  K = 1 …10  Manually tagged pairs as follows:  True positives Differ slightly  False positives Radically different pairs  Unknown Could not be evaluated

216/20/2011Detecting Near-Duplicates for Web Crawling Experiment: Results  Accuracy  Low k value -> a lot of false negatives  High k value -> a lot of false positives  Best value -> k = 3 75% of near-duplicates reported 75% of reported cases are true positives  Running Time  Solution Hamming Distance: O(log(p))  Batch Query + Compression: 32GB File & 200 tasks -> runs under 100 seconds

226/20/2011Detecting Near-Duplicates for Web Crawling Related Work  Clustering related documents  Detect near-duplicates to show related pages  Data extraction  Determine schema of similar pages to obtain information  Plagiarism  Detect pages that have borrowed from each other  Spam  Detect spam before user receives it

236/20/2011Detecting Near-Duplicates for Web Crawling Tying it Back to Lecture  Similarities  Indicated importance of de-duplication to save crawler resources  Brief summary of several uses for near-duplicate detection  Differences  Lecture focus: Breadth-first look at algorithms for near-duplicate detection  Paper focus: In-depth look of simhash and Hamming Distance algorithm Includes how to implement and effectiveness

246/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Pros  Thorough step-by-step explanation of the algorithm implementation  Thorough explanation on how the conclusions were reached  Included brief description of how to improve simhash + Hamming Distance algorithm  Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.

256/20/2011Detecting Near-Duplicates for Web Crawling Paper Evaluation: Cons  No comparison  How much more effective or faster is it than other algorithms?  By how much did it improve the crawler?  Limited batch queries to a specific technology  Implementation required use of GFS  Approach not restricted to certain technology might be more applicable

266/20/2011Detecting Near-Duplicates for Web Crawling Any Questions? ???