Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

Slides:



Advertisements
Similar presentations
Analysis of Computer Algorithms
Advertisements

Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Google News Personalization: Scalable Online Collaborative Filtering
Information Retrieval in Practice
Computer Forensic Analysis By Aaron Cheeseman Excerpt from Investigating Computer-Related Crime By Peter Stephenson (2000) CRC Press LLC - Computer Crimes.
Choosing an Order for Joins
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Near-Duplicates Detection
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Data Structures Using C++ 2E
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems Authors: M. Mezmaz, N. Melab and E-G. Talbi Presented.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
A Web Crawler Design for Data Mining
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Hashing Hashing is another method for sorting and searching data.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
1 VLDB, Background What is important for the user.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Data Structures Using C++ 2E
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
The Anatomy of a Large-Scale Hypertextual Web Search Engine
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
Finding replicated web collections
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
One-Pass Algorithms for Database Operations (15.2)
CS246: Search-Engine Scale
Information Retrieval and Web Design
Presentation transcript:

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran 1USC CS572

Near duplicate documents: Two documents that are identical in terms of content but differ in a small portion of the document such as advertisements, counters and timestamps. These differences are irrelevant for web search. So if a newly crawled web page is deemed a near- duplicate of an already crawled page, the crawl engine should ignore the duplicate and all it outgoing links. Elimination of near-duplicates saves network bandwidth, reduces storage costs and improves the quality of search indexes. It also reduces the load on the remote host that is serving such web pages. 2USC CS572

Challenges faced in detecting near- duplicate pages : Scaling issue : search engines index billions of web pages and this amounts to a multi-terabyte database the crawl engine should be able to crawl billions of web pages day. So the decision to mark a newly-crawled web page as a near duplicate of an existing page should be made quickly The system should use as few machines as possible 3USC CS572

Major contributions of this paper: This paper establishes that Charikar's simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository. simhash is a fingerprinting technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions. This paper talks about a technique for solving the Hamming Distance Problem. The technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). The paper presents a survey of algorithms and techniques for duplicate detection. 4USC CS572

So what is so special about simhash? a hash function usually hashes different values to totally different hash values simhash is one where similar items are hashed to similar hash values (by similar we mean the bitwise hamming distance between hash values) 5USC CS572

Simhash Calculation the simhash of a phrase is calculated as follows: 1)pick a hashsize, lets say 32 bits 2)let V = [0] * 32 # (ie. 32 zeros) 3)break the phrase up into features E.g..>'the cat sat on the mat'.shingles => # 4) hash each feature using a normal 32-bit hash algorithm "th".hash = "he".hash = ) for each hash if bit i of hash is set then add 1 to V[i] if bit i of hash is not set then take 1 from V[i] 6) simhash bit i is 1 if V[i] > 0 and 0 otherwise 6USC CS572

THE HAMMING DISTANCE PROBLEM Definition: Given a collection of f-bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint). In the online version of the problem, for a query fingerprint F, we have to ascertain within a few milliseconds whether any of the existing 8B 64-bit fingerprints differs from F in at most k = 3 bit- positions. How can this be done? 7USC CS572

Build a sorted table of all existing fingerprints. OR pre-compute all fingerprints f such that some existing fingerprint is at most Hamming distance k away from f. Both approaches have high space and time complexity. So what is the solution? 8USC CS572

Algorithm for Online Queries Build t Tables Associated with each table are two quantities : an integer ‘i’ and a permutation ‘P’ over the f bit-positions A table T is constructed by applying permutation P to each existing fingerprint; the resulting set of permuted f-bit fingerprints are sorted Each table is compressed and stored in main memory of a set of machines Given a fingerprint F and an integer k, we probe these tables in parallel: Step1: Identify all permuted fingerprints in table T whose top i bit- positions match the top i bit-positions of P(F) Step2: For each of the permuted fingerprints identified in Step1, check if it differs from P(F) in at most k bit-positions. 9USC CS572

Algorithm for Batch Queries In the batch version of the Hamming Distance Problem, we have a batch of query fingerprints instead of a solitary query fingerprint. Assume that existing fingerprints are stored in file F and that the batch of query fingerprints are stored in file Q. With 8B 64-bit fingerprints, file F will occupy 64GB. Compression shrinks the file size to less than 32GB. A batch has of the order of 1M fingerprints, so let us assume that file Q occupies 8MB. Using the MapReduce framework, the overall computation can be split conveniently into two phases. In the first phase, there are as many computational tasks as the number of chunks of F (in MapReduce terminology, such tasks are called mappers). Each task solves the Hamming Distance Problem over some 64-MB chunk of F and the entire file Q as inputs. A list of near-duplicate fingerprints discovered by a task is produced as its output. 10USC CS572

In the second phase, MapReduce collects all the outputs, removes duplicates and produces a single sorted file. MapReduce strives to maximize locality, i.e., most mappers are co-located with machines that hold the chunks assigned to them; this avoids shipping chunks over the network. Second, file Q is placed in a GFS directory with replication factor far greater than three. Thus copying file Q to various mappers does not become a bottleneck 11USC CS572

EXPERIMENTAL RESULTS No previous work has studied the trade-off between f and k for the purpose of detection of near-duplicate web-pages using simhash Is simhash a reasonable fingerprinting technique for near-duplicate detection?? Experiments were conducted to study the same. Let’s see what they were. 12USC CS572

Choice of Parameters 8B simhash fingerprints, varied k from 1 to 10. Distribution of Fingerprints Fingerprints are more or less equally spaced out. Scalability Compression plays an important role in speedup because for a fixed number of mappers 13USC CS572

FUTURE EXPLORATIONS 1)Document Size 2)Space Pruning 3)Categorizing web-pages 4)Detecting specific portions and omitting them 5)Sensitivity of simhash to changes in the algorithm for feature-selection 6)Relevance of simhash-based techniques for focused crawlers 7)Near-duplicate detection facilitating document clustering 14USC CS572

Summary Most algorithms for near-duplicate detection run in batch-mode over the entire collection of documents. For web crawling, an online algorithm is necessary because the decision to ignore the hyper-links in a recently-crawled page has to be made quickly. The scale of the problem (billions of documents) limits us to small-sized fingerprints. Charikar's simhash technique with 64-bit fingerprints seems to work well in practice for a repository of 8B web pages. USC CS57215