1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation
2 The Problem Goal: fetch a large number (millions) of items (web pages, images, etc.) via HTTP – Politely, so that we minimize complaints – With good performance – Run-time configuration – A research crawler that is easily extensible to allow a variety of crawling tasks Requires a fair amount of engineering to scale to several hundreds of documents per second – Distributed system – Fetch many URLs in parallel
3 High level overview Get URL to download Perform DNS Resolution Check filters / Robots.txt rules Fetch Document Process Document / Extract Links
4 System Architecture Distributed – scale by adding more machines – URLs partitioned among crawlers by hostname hash Multi-threaded to perform parallel URL fetches Major components are pluggable via class inheritance, many have multiple implementations
5 Run Time Configuration.Net Application Configuration file – Key/Value Pairs – Can choose implementations of components – Size data structures to fit task/machine Have run crawler of a variety of hardware – 800 MHz Pentium III with 512 MB RAM – Quad core Opteron with 16 GB RAM
6 Duplicate URL Eliminator Many URLs, such as the Acrobat download link, occur on a significant number of pages Need to be able to check and see if the crawler has previously encountered an URL For example, crawling just 108.5m pages yields 1.6b unique URLs Even with 8 byte hash, can’t scale if hashes stored in memory Also can’t afford a disk seek per lookup, so need to buffer requests
7 Duplicate URL Eliminator Current Implementations: – In memory hash table – In memory table with recent hashes Full set of hashes kept sorted on disk Current URLs also on disk When in memory table reaches a certain load, sort and merge with hashes on disk, send new URLs to Frontier
8 Frontier The frontier component manages the list of URLs that should be crawled Suggests politeness by telling calling thread at what time returned URL can be downloaded Can cache results from first successful DNS resolution
9 Polite Frontier It maintains two types of queues, both containing a head and tail in memory with the remainder buffered on disk – The main queue, containing all URLs to be crawled – Many per-host queues, each containing URLs from one hostname Have a configurable multiplier of the number of threads, currently using 600 threads with a multiplier of 3 Politeness is maintained by having a priority queue of per-host queues ordered by the time that host can be contacted again – Entries removed from queue when URL returned to worker thread – Entries added when a download (or failure) is reported to the frontier, the delay currently being used is 10 times as long as the previous download took DNS results cached as long as that host has an active host queue in the frontier
10 Processing Modules After a document is downloaded with an HTTP result code of 200 or “OK”, the content needs to be processed Processor modules are associated with either specific mime types or with any mime type The process method of each matching module is called with the document as an argument Modules exist for: – Writing out all text documents – Writing out binary files – Writing out MD5 checksums for the content – Extracting links from text/html documents
11 Saving documents The TextFileWriter class writes out the following information for each text/html document: – URL – Referring URL if any – List of IP Addresses that the hostname referred to – The length of the document in bytes, including HTTP headers – The document content The TextFilePerThreadWriter keeps one TextFileWriter per thread in thread-local storage The BinaryFileWriter is similar, but only includes the URL and the document content, excluding HTTP headers The source and destination URLs are logged for all redirects
12 Extensibility Easy to write additional processing modules public abstract class ProcessorModule{ public abstract void Process( DocBundle db, ReuseableStream rs); … }
13 Checkpointing Crawl State Implemented via C# interface Acquire a global lock on all crawlers Call checkpoint method on each module that implements “Icheckpointable” interface After all nodes complete checkpoint method, commit checkpoint to disk, removing any unnecessary files from previous checkpoint Release global lock
14 Recovering Crawl State Also implemented via “Icheckpointable” interface Currently implemented as follows: – Initialize a new crawl – Move files from previous checkpoint into right spot in new crawl directory via batch file – Issue “restore ” command
15 Our setup MB/s Fast Ethernet connections 2 - host based routers – Windows Server 2003 / ISA Server – ~10% CPU load with 100 MB/s traffic 4 crawlers – Quad core Opteron – 16 GB Memory – GB Disks (5 disks in single RAID 5 volume)