Allan Heydon and Mark Najork --Sumeet Takalkar
Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture Components of Mercator Extensibility Hazards Results Conclusions
Scalable Web Crawler's design was not well documented when this paper was presented. Enumerate the major components of Mercator and its support for extensibility and customizability.
a scalable, extensible web crawler… Scalable Designed to scale up to the entire web. By implementing data structures that use bounded amount of memory regardless of the size of crawl. Majority of data structures are stored on disk and their small parts in memory. Extensible Mercator is designed in a modular way, with the expectation that new functionality will be added by third parties
AlgorithmComponents 1. Remove a URL from the URL listURL frontier 2. Determine the IP address of its host name Domain Name Resolution 3. Download the corresponding document HTTP Protocol Module 4. Extract any links contained in document Content Seen Test 5. For each of the extracted links, ensure that it is an absolute URL URL filter 6. Add a URL to the list of URLs to download, provided it has not been encountered before URL Seen Test
1. Remove absolute URL from the shared URL frontier for downloading 2. Invoke protocol module's fetch method, which downloads the document from internet into a per-thread RewindInputStream 3. The worker thread invokes the content-seen test to determine whether this document has been seen before 4. Based on the downloaded document's MIME type, the worker invokes the process method of each processing module associated with that MIME type 5. Each link is converted into an absolute URL, and tested against a user-supplied URL filter to determine if it should be download 6. If the URL passes the filter, the worker performs the URL-seen test, which checks if the URL has been seen before 7. If the URL is new, it is added to the frontier
The URL frontier is the data structure that contains all the URLs that remain to be downloaded Preventing from overloading a Web Server Use Distinct FIFO subqueues One FIFO subqueue per thread When a URL is added, FIFO subqueue in which it is placed is determined by canonical hostname. Canonical host name : URLs that map a variety of hostnames to the same content.
Fetch the document corresponding to a given URL Protocols include HTTP, FTP, Gopher Robots Exclusion protocol: Protocol that defines the limitations for a web crawler as it visits a website These declarations are stored in a special document i.e. Robot.txt, which is required to be fetched before downloading any real content. Mercator maintains a fixed-size cache mapping host names to their robot exclusion rules. To prevent a malicious web server to cause a worker thread to hang indefinitely, they implemented a “Lean and mean “ HTTP protocol with request time out after 1 minute and minimal synchronization and allocation overhead.
HTTP FTP HTTP RIS Content Seen Link Extractor Tag Counter GIF status
In Mercator, same document is processed by multiple processing modules. RIS is used to avoid reading a document multiple times. Cache the document locally using Rewind Input Stream. A RIS caches small documents (64 KB or less) entirely in memory, while larger documents are temporarily written to a backing file (limit 1 MB)
Many documents are available under multiple, different URLs OR There are also many cases in which document are mirrored on multiple servers To prevent processing a document more than once, a Web crawler may wish to perform a content-seen test to decide if the document has already been processed. To save space and time, Mercator uses a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document Mercator compute the checksum using Broder’s implementation of Rabin’s fingerprinting algorithm Fingerprints offer provably strong probabilistic guarantees that two different string will not have the same fingerprint
Steps in Content Seen Test 1)Check if the Fingerprint is Contained in the in-memory table. IF NOT, execute step 2. 2) Check if fingerprint resides in disk file(Use Interpolated Binary Search). IF NOT, execute step 3 3) Add the new FP to the in-memory table 4) IF hash table fills up merge the contents on disk 5)Update in-memory index of hash table
The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded The URL filter class has a single crawl method that takes a URL and returns a Boolean value indicating whether or not to crawl that URL
INTERNETINTERNET DNS resolver DNS request INTERNETINTERNET DNS resolver DNS request Multi-thread Interface JAVA Interface
Map web server’s host name into an IP address. For most web crawlers it is a well documented Bottleneck. Caching DNS results, is partially effective because Java interface to DNS lookup is synchronized. Used multi-threaded DNS resolver that can resolve host names much more rapidly than either the Java or Unix resolver
To avoid downloading similar document multiple times, a URL-seen test must be performed on each extracted link To perform the URL-seen test, all of the URLs seen by Mercator are stored in canonical form in a large table called the URL set To save space store it in a fixed-sized check-sum instead of text representation To reduce the number of operations on the backing disk file, Mercator keeps an in-memory cache of popular URLs
Web Server Threads Machine 1 Machine 2 Google and Archive Crawlers Web Server Threads Machine 1 Mercator
Google and Internet Archive crawlers Use single-threaded crawling processes and asynchronous I/O to perform multiple download in parallel They are designed from the ground up to scale to multiple machines Mercator Uses a multi-threaded process in which each thread performs synchronous I/O It would not be too difficult to adapt Mercator to run on multiple machines
To complete a crawl of the entire Web, Mercator writes regular snapshots of its state to disk An interrupted or aborted crawl can easily be restarted from the latest checkpoint Mercator’s core classes and all user-supplied modules are required to implement the checkpointing interface Checkpointing are coordinated using a global readers-writer lock Each worker thread acquires a read share of the lock while processing a downloaded document Once a day, Mercator’s main thread has acquired the lock, it arranges for the checkpoint methods
1. Extended with new functionality 2. Reconfigured to use different versions of major components Making a Extensible System: Key Ingredients: Define Interface to each of the systems component Mechanism for specifying system is configured from various components Sufficient infrastructure to write new components.
Protocol and Processing Modules ◦ Processing the documents, other than extracting links ◦ Protocol modules for FTP and Gopher protocols Alternative URL Frontier Implementation ◦ Dynamically assign hosts to worker threads ◦ Multiple hosts might be assigned to the same worker thread, while others are left idle typically on an intranet. Random Walker ◦ Start from a random page taken from a set of seeds ◦ Fetch next page by choosing a random link from current page
URL Aliases ◦ Host Name Aliases ◦ Omitted Port Numbers ◦ Alternative paths on the same host ◦ Replication across different hosts Session IDs Embedded in URLs ◦ Session IDs create potentially infinite Set of URLs Crawler Trap ◦ URL that cause a crawler crawl indefinitely
Performance (May 1999) HTTP requests DaysDownload rateDownload speed Mercator77.4 million 8112 docs/sec1682KB/sec Google26 million933.5doc/sec200KB/sec Internet Archive 80 million846.3 docs/sec231KB/sec
Selected Web statistics Each URL from frontier cause a HTTP request, however two issues related to Robot.txt Check for appropriate version of Robot.txt, if not, then a extra HTTP request required. If Robot.txt indicated a document should not be downloaded 80% of documents between 1K and 32K 8.5% of successful HTTP request were duplicates
Main components of any scalable crawler and its design alternatives Scalability Machines of memory sizes ranging from 128MB to 2GB Extensibility Writing new modules
Q & A