Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.

Similar presentations


Presentation on theme: "1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기."— Presentation transcript:

1 1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기 최해기

2 Database System Laboratory 2 Contents 3. Implementation Details and Algorithmic Techniques 3.4 Crawl Manager Data Structure 3.5 Scheduling Policy and Manager Performance 4. Experimental Results and Experiences 4.1 Results of a Large Crawl 4.2 Network Limits and Speed Control 4.3 System Performance and Configuration 4.4 Other Experiences 5. Comparison with Other Systems 6. Conclusions and Future Work

3 Database System Laboratory 3 Manager Data Structures

4 Database System Laboratory 4 3.4 Crawl Manager Data Structure The crawl manager maintains a number of data structures for scheduling the requests on the downloaders FIFO request queue A list of request files Contains a few hundred or thousand URLs Be located on a disk Not immediately loaded Stay on disk as long as possible

5 Database System Laboratory 5 3.4 Crawl Manager Data Structure (2) There are a number of FIFO host queues containing URLs, organized by hostname Berkeley DB using a single B-Tree (hostname = key) Once a host has been selected for download we take the first entry in the corresponding host queue send it to a downloader

6 Database System Laboratory 6 3.4 Crawl Manager Data Structure (3) Three different host structures Host dictionary An entry for each host A pointer to the corresponding host queue Ready queue A pointer to those hosts that are ready for download Waiting queue A pointer to those hosts that have recently been accessed and are now waiting for 30 seconds

7 Database System Laboratory 7 3.4 Crawl Manager Data Structure (4) Each host pointer in the ready queue has as its key value the request number of the first URL in the corresponding host queue Select the URL with the lowest request number among all URLs that are ready to be downloaded, and it to the downloader After the page has been downloaded, a pointer to the host is inserted into the waiting queue with waiting time If its waiting time has passed, transfer hosts back into the ready queue

8 Database System Laboratory 8 3.4 Crawl Manager Data Structure (5) When a new host is encountered Create a host structure Put it into the host dictionary Insert a pointer to the host into the ready queue When all URLs in a queue have been downloaded The host is deleted from the structures Certain information in the robots files is kept If a host is not responding Put the host into the waiting queue for some time

9 Database System Laboratory 9 3.5 Scheduling Policy and Manager Performance If we immediately insert all the URLs into the Berkeley DB B-tree structure Quickly grow beyond main memory size Result in bad I/O behavior Thus, we would like to delay inserting the URLs into the structures as long as possible

10 Database System Laboratory 10 3.5 Scheduling Policy and Manager Performance (2) Goal : significantly decrease the size of the structures We have enough hosts to keep the crawl running at the given speed The total number of host structures and corresponding URL queues at any time is about x+st+nd+nt x: the number of hosts in the ready queue st: an estimate of the number of currently waiting nd: the number of hosts that are waiting because they were down nt: the number of hosts in the dictionary (ignoring) The number of host structures will usually be less than 2x Ex> x =10000, which for a speed of 120 pages per second resulted in at most 16000 hosts in the manager

11 Database System Laboratory 11 3.5 Scheduling Policy and Manager Performance (3) The ordering is in fact the same as if we immediately insert all request URLs into the manager Assume that when the number of hosts in the ready queue drops below x, the manager will be able to increase this number again to at least x before the downloaders actually run out of work

12 Database System Laboratory 12 4.1 Results of a Large Crawl 120 million web pages on about 5 million hosts (18 days) In the last 4days, the crawler was running at very low speed to download URLs from a few hundred very large host queues that remained During operation, the speed of the crawler was limited by us to a certain rate, depending on time of day, other users on campus were not inconvenienced

13 Database System Laboratory 13 4.1 Results of a Large Crawl (2) Network errors A server down Dows not exist Behaves incorrectly Extremely slow Some robots files were downloaded many times

14 Database System Laboratory 14 4.2 Network Limits and Speed Control We had to control the speed of our crawler so that impact on other campus users is minimized Usually limited rates to about 80 pages per second(1MB/s) during peak times Up to 180pages per second during the late night and early morning Limits can be changed and displayed via a web-based Java interface Connected to the Internet by a T3 link, with Cisco 3620 as main campus router

15 Database System Laboratory 15 4.2 Network Limits and Speed Control (2) This data includes all traffic going in and out of the poly.edu domain over the 24 hours of May 28, 2001. At high speed, relatively little other traffic Perform a check point every 4 hours Does not exist in the outgoing bytes, since the crawler only sends out small requests Clearly visible in the number of outgoing frames, partly due to HTTP requests and the DNS system Incoming bytesoutgoing bytesoutgoing frames

16 Database System Laboratory 16 4.3 System Performance and Configuration Sun Ultra10 workstations and a dual-processor Sun E250 Downloader Most of the CPU, little memory Manager Little CPU time Reasonable amount (100MB) of buffer space for Berkeley DB Downloader and the manager on one machine, and all other components on the other

17 Database System Laboratory 17 5. Comparison with Other Systems Mercator Flexibility through pluggable components Centralized crawler Data can be directly parsed in memory and does not have to be written from disk Uses caching to catch most of the random I/O and fast disk system Good I/O performance by hashing hostnames

18 Database System Laboratory 18 5. Comparison with Other Systems (2) Atrax A recent distributed version of Mercator Ties several Mercator systems together Not yet familiar with many details of Atrax Uses a disk-efficient merge Very similar approach for scaling Uses Mecator as its basic unit of replication

19 Database System Laboratory 19 6. Conclusions and Future Work We have… Described the architecture and implementation details of our crawling system Presented some preliminary experiments There are obviously many improvements to the system Future work… A detailed study of the scalability of the system and the behavior of its components


Download ppt "1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기."

Similar presentations


Ads by Google