Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 10 Servlets and Java Server Pages.
Enabling Secure Internet Access with ISA Server
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
1 Content Delivery Networks iBAND2 May 24, 1999 Dave Farber CTO Sandpiper Networks, Inc.
Technical Architectures
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Memory Management (II)
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Naming Names in computer systems are used to share resources, to uniquely identify entities, to refer to locations and so on. An important issue with naming.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Application-Layer Anycasting By Samarat Bhattacharjee et al. Presented by Matt Miller September 30, 2002.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
A Web Crawler Design for Data Mining
Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
Crawling Slides adapted from
Chapter 8 Cookies And Security JavaScript, Third Edition.
Hour 7 The Application Layer 1. What Is the Application Layer? The Application layer is the top layer in TCP/IP's protocol suite Some of the components.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Presented by Rebecca Meinhold But How Does the Internet Work?
1 Web Servers (Chapter 21 – Pages( ) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3 System Architecture.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
JS (Java Servlets). Internet evolution [1] The internet Internet started of as a static content dispersal and delivery mechanism, where files residing.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
CSI 3125, Preliminaries, page 1 Networking. CSI 3125, Preliminaries, page 2 Networking A network represents interconnection of computers that is capable.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Statistics Visualizer for Crawler
UbiCrawler: a scalable fully distributed Web crawler
CS 430: Information Discovery
Processes The most important processes used in Web-based systems and their internal organization.
Anwar Alhenshiri.
Virtual Memory 1 1.
Presentation transcript:

Allan Heydon and Mark Najork --Sumeet Takalkar

Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture Components of Mercator Extensibility Hazards Results Conclusions

 Scalable Web Crawler's design was not well documented when this paper was presented.  Enumerate the major components of Mercator and its support for extensibility and customizability.

a scalable, extensible web crawler… Scalable Designed to scale up to the entire web. By implementing data structures that use bounded amount of memory regardless of the size of crawl. Majority of data structures are stored on disk and their small parts in memory. Extensible Mercator is designed in a modular way, with the expectation that new functionality will be added by third parties

AlgorithmComponents 1. Remove a URL from the URL listURL frontier 2. Determine the IP address of its host name Domain Name Resolution 3. Download the corresponding document HTTP Protocol Module 4. Extract any links contained in document Content Seen Test 5. For each of the extracted links, ensure that it is an absolute URL URL filter 6. Add a URL to the list of URLs to download, provided it has not been encountered before URL Seen Test

1. Remove absolute URL from the shared URL frontier for downloading 2. Invoke protocol module's fetch method, which downloads the document from internet into a per-thread RewindInputStream 3. The worker thread invokes the content-seen test to determine whether this document has been seen before 4. Based on the downloaded document's MIME type, the worker invokes the process method of each processing module associated with that MIME type 5. Each link is converted into an absolute URL, and tested against a user-supplied URL filter to determine if it should be download 6. If the URL passes the filter, the worker performs the URL-seen test, which checks if the URL has been seen before 7. If the URL is new, it is added to the frontier

 The URL frontier is the data structure that contains all the URLs that remain to be downloaded  Preventing from overloading a Web Server  Use Distinct FIFO subqueues  One FIFO subqueue per thread  When a URL is added, FIFO subqueue in which it is placed is determined by canonical hostname. Canonical host name : URLs that map a variety of hostnames to the same content.

 Fetch the document corresponding to a given URL  Protocols include HTTP, FTP, Gopher  Robots Exclusion protocol: Protocol that defines the limitations for a web crawler as it visits a website  These declarations are stored in a special document i.e. Robot.txt, which is required to be fetched before downloading any real content.  Mercator maintains a fixed-size cache mapping host names to their robot exclusion rules.  To prevent a malicious web server to cause a worker thread to hang indefinitely, they implemented a “Lean and mean “ HTTP protocol with request time out after 1 minute and minimal synchronization and allocation overhead.

HTTP FTP HTTP RIS Content Seen Link Extractor Tag Counter GIF status

 In Mercator, same document is processed by multiple processing modules. RIS is used to avoid reading a document multiple times.  Cache the document locally using Rewind Input Stream.  A RIS caches small documents (64 KB or less) entirely in memory, while larger documents are temporarily written to a backing file (limit 1 MB)

Many documents are available under multiple, different URLs OR There are also many cases in which document are mirrored on multiple servers To prevent processing a document more than once, a Web crawler may wish to perform a content-seen test to decide if the document has already been processed. To save space and time, Mercator uses a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document Mercator compute the checksum using Broder’s implementation of Rabin’s fingerprinting algorithm Fingerprints offer provably strong probabilistic guarantees that two different string will not have the same fingerprint

Steps in Content Seen Test 1)Check if the Fingerprint is Contained in the in-memory table. IF NOT, execute step 2. 2) Check if fingerprint resides in disk file(Use Interpolated Binary Search). IF NOT, execute step 3 3) Add the new FP to the in-memory table 4) IF hash table fills up merge the contents on disk 5)Update in-memory index of hash table

 The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded  The URL filter class has a single crawl method that takes a URL and returns a Boolean value indicating whether or not to crawl that URL

INTERNETINTERNET DNS resolver DNS request INTERNETINTERNET DNS resolver DNS request Multi-thread Interface JAVA Interface

 Map web server’s host name into an IP address.  For most web crawlers it is a well documented Bottleneck.  Caching DNS results, is partially effective because Java interface to DNS lookup is synchronized.  Used multi-threaded DNS resolver that can resolve host names much more rapidly than either the Java or Unix resolver

 To avoid downloading similar document multiple times, a URL-seen test must be performed on each extracted link  To perform the URL-seen test, all of the URLs seen by Mercator are stored in canonical form in a large table called the URL set  To save space store it in a fixed-sized check-sum instead of text representation  To reduce the number of operations on the backing disk file, Mercator keeps an in-memory cache of popular URLs

Web Server Threads Machine 1 Machine 2 Google and Archive Crawlers Web Server Threads Machine 1 Mercator

Google and Internet Archive crawlers Use single-threaded crawling processes and asynchronous I/O to perform multiple download in parallel They are designed from the ground up to scale to multiple machines Mercator Uses a multi-threaded process in which each thread performs synchronous I/O It would not be too difficult to adapt Mercator to run on multiple machines

 To complete a crawl of the entire Web, Mercator writes regular snapshots of its state to disk  An interrupted or aborted crawl can easily be restarted from the latest checkpoint  Mercator’s core classes and all user-supplied modules are required to implement the checkpointing interface  Checkpointing are coordinated using a global readers-writer lock  Each worker thread acquires a read share of the lock while processing a downloaded document  Once a day, Mercator’s main thread has acquired the lock, it arranges for the checkpoint methods

1. Extended with new functionality 2. Reconfigured to use different versions of major components Making a Extensible System: Key Ingredients:  Define Interface to each of the systems component  Mechanism for specifying system is configured from various components  Sufficient infrastructure to write new components.

 Protocol and Processing Modules ◦ Processing the documents, other than extracting links ◦ Protocol modules for FTP and Gopher protocols  Alternative URL Frontier Implementation ◦ Dynamically assign hosts to worker threads ◦ Multiple hosts might be assigned to the same worker thread, while others are left idle typically on an intranet.  Random Walker ◦ Start from a random page taken from a set of seeds ◦ Fetch next page by choosing a random link from current page

 URL Aliases ◦ Host Name Aliases ◦ Omitted Port Numbers ◦ Alternative paths on the same host ◦ Replication across different hosts  Session IDs Embedded in URLs ◦ Session IDs create potentially infinite Set of URLs  Crawler Trap ◦ URL that cause a crawler crawl indefinitely

 Performance (May 1999) HTTP requests DaysDownload rateDownload speed Mercator77.4 million 8112 docs/sec1682KB/sec Google26 million933.5doc/sec200KB/sec Internet Archive 80 million846.3 docs/sec231KB/sec

 Selected Web statistics  Each URL from frontier cause a HTTP request, however two issues related to Robot.txt  Check for appropriate version of Robot.txt, if not, then a extra HTTP request required.  If Robot.txt indicated a document should not be downloaded  80% of documents between 1K and 32K  8.5% of successful HTTP request were duplicates

Main components of any scalable crawler and its design alternatives Scalability Machines of memory sizes ranging from 128MB to 2GB Extensibility Writing new modules

Q & A