1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.

Slides:



Advertisements
Similar presentations
Chapter 4 Memory Management Basic memory management Swapping
Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,
Chapter 101 Cleaning Policy When should a modified page be written out to disk?  Demand cleaning write page out only when its frame has been selected.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Computer Organization and Architecture
Chapter 3.  Help you understand different types of servers commonly found on a network including: ◦ File Server ◦ Application Server ◦ Mail Server ◦
A Web Crawler Design for Data Mining
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
4P13 Week 9 Talking Points
Bigtable: A Distributed Storage System for Structured Data
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Virtual Memory.
Getting the Most out of Scientific Computing Resources
Understanding Web Server Programming
OPERATING SYSTEM CONCEPT AND PRACTISE
Web Server Load Balancing/Scheduling
Getting the Most out of Scientific Computing Resources
Computer Organization
Module 12: I/O Systems I/O hardware Application I/O Interface
Web Server Load Balancing/Scheduling
Database Management System
Ramya Kandasamy CS 147 Section 3
The Buffer Cache.
Mechanism: Limited Direct Execution
Web Caching? Web Caching:.
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Chapter 12: Query Processing
Memory Management for Scalable Web Data Servers
CSCI 315 Operating Systems Design
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Chapter 9: Virtual-Memory Management
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Operating Systems.
O.S Lecture 14 File Management.
Distributed File Systems
A Simulator to Study Virtual Memory Manager Behavior
Distributed File Systems
Lecture 2- Query Processing (continued)
CSE 451: Operating Systems Spring Module 21 Distributed File Systems
Chapter 2: Operating-System Structures
Distributed File Systems
Chapter 2: The Linux System Part 5
Chapter 11 I/O Management and Disk Scheduling
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Lecture 3: Main Memory.
Database System Architectures
Distributed File Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 2: Operating-System Structures
Distributed File Systems
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기 최해기

Database System Laboratory 2 Contents 3. Implementation Details and Algorithmic Techniques 3.4 Crawl Manager Data Structure 3.5 Scheduling Policy and Manager Performance 4. Experimental Results and Experiences 4.1 Results of a Large Crawl 4.2 Network Limits and Speed Control 4.3 System Performance and Configuration 4.4 Other Experiences 5. Comparison with Other Systems 6. Conclusions and Future Work

Database System Laboratory 3 Manager Data Structures

Database System Laboratory Crawl Manager Data Structure The crawl manager maintains a number of data structures for scheduling the requests on the downloaders FIFO request queue A list of request files Contains a few hundred or thousand URLs Be located on a disk Not immediately loaded Stay on disk as long as possible

Database System Laboratory Crawl Manager Data Structure (2) There are a number of FIFO host queues containing URLs, organized by hostname Berkeley DB using a single B-Tree (hostname = key) Once a host has been selected for download we take the first entry in the corresponding host queue send it to a downloader

Database System Laboratory Crawl Manager Data Structure (3) Three different host structures Host dictionary An entry for each host A pointer to the corresponding host queue Ready queue A pointer to those hosts that are ready for download Waiting queue A pointer to those hosts that have recently been accessed and are now waiting for 30 seconds

Database System Laboratory Crawl Manager Data Structure (4) Each host pointer in the ready queue has as its key value the request number of the first URL in the corresponding host queue Select the URL with the lowest request number among all URLs that are ready to be downloaded, and it to the downloader After the page has been downloaded, a pointer to the host is inserted into the waiting queue with waiting time If its waiting time has passed, transfer hosts back into the ready queue

Database System Laboratory Crawl Manager Data Structure (5) When a new host is encountered Create a host structure Put it into the host dictionary Insert a pointer to the host into the ready queue When all URLs in a queue have been downloaded The host is deleted from the structures Certain information in the robots files is kept If a host is not responding Put the host into the waiting queue for some time

Database System Laboratory Scheduling Policy and Manager Performance If we immediately insert all the URLs into the Berkeley DB B-tree structure Quickly grow beyond main memory size Result in bad I/O behavior Thus, we would like to delay inserting the URLs into the structures as long as possible

Database System Laboratory Scheduling Policy and Manager Performance (2) Goal : significantly decrease the size of the structures We have enough hosts to keep the crawl running at the given speed The total number of host structures and corresponding URL queues at any time is about x+st+nd+nt x: the number of hosts in the ready queue st: an estimate of the number of currently waiting nd: the number of hosts that are waiting because they were down nt: the number of hosts in the dictionary (ignoring) The number of host structures will usually be less than 2x Ex> x =10000, which for a speed of 120 pages per second resulted in at most hosts in the manager

Database System Laboratory Scheduling Policy and Manager Performance (3) The ordering is in fact the same as if we immediately insert all request URLs into the manager Assume that when the number of hosts in the ready queue drops below x, the manager will be able to increase this number again to at least x before the downloaders actually run out of work

Database System Laboratory Results of a Large Crawl 120 million web pages on about 5 million hosts (18 days) In the last 4days, the crawler was running at very low speed to download URLs from a few hundred very large host queues that remained During operation, the speed of the crawler was limited by us to a certain rate, depending on time of day, other users on campus were not inconvenienced

Database System Laboratory Results of a Large Crawl (2) Network errors A server down Dows not exist Behaves incorrectly Extremely slow Some robots files were downloaded many times

Database System Laboratory Network Limits and Speed Control We had to control the speed of our crawler so that impact on other campus users is minimized Usually limited rates to about 80 pages per second(1MB/s) during peak times Up to 180pages per second during the late night and early morning Limits can be changed and displayed via a web-based Java interface Connected to the Internet by a T3 link, with Cisco 3620 as main campus router

Database System Laboratory Network Limits and Speed Control (2) This data includes all traffic going in and out of the poly.edu domain over the 24 hours of May 28, At high speed, relatively little other traffic Perform a check point every 4 hours Does not exist in the outgoing bytes, since the crawler only sends out small requests Clearly visible in the number of outgoing frames, partly due to HTTP requests and the DNS system Incoming bytesoutgoing bytesoutgoing frames

Database System Laboratory System Performance and Configuration Sun Ultra10 workstations and a dual-processor Sun E250 Downloader Most of the CPU, little memory Manager Little CPU time Reasonable amount (100MB) of buffer space for Berkeley DB Downloader and the manager on one machine, and all other components on the other

Database System Laboratory Comparison with Other Systems Mercator Flexibility through pluggable components Centralized crawler Data can be directly parsed in memory and does not have to be written from disk Uses caching to catch most of the random I/O and fast disk system Good I/O performance by hashing hostnames

Database System Laboratory Comparison with Other Systems (2) Atrax A recent distributed version of Mercator Ties several Mercator systems together Not yet familiar with many details of Atrax Uses a disk-efficient merge Very similar approach for scaling Uses Mecator as its basic unit of replication

Database System Laboratory Conclusions and Future Work We have… Described the architecture and implementation details of our crawling system Presented some preliminary experiments There are obviously many improvements to the system Future work… A detailed study of the scalability of the system and the behavior of its components