Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p , Dec May Sun Woo Kim
Database System Laboratory 2 Content Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions
Database System Laboratory 3 Extensibility Extend with new functionality New protocol and processing modules Different versions of most of its major components Ingredients Interface an abstract class Mechanism a configuration file Infrastructure
Database System Laboratory 4 Protocol and processing modules Abstract Protocol class fetch method: download the document newURL method: parse a given string Abstract Analyzer class process method: process it appropriately Different Analyzer subclasses GifStats TagCounter WebLinter: runs the Weblint program
Database System Laboratory 5 Alternative URL frontier Drawback on intranet Multiple hosts might be assigned to the same thread Solution URL frontier component that dynamically assigns host Maximized the number of busy worker threads Is well-suited to host-limited crawls
Database System Laboratory 6 As a random walker Random walker Starts at a random page taken from a set of seeds The next page is selected by choosing a random link Differences A page may be revisited multiple times Only one link is followed each time To support random walking A new URL frontier Records only the URLs discovered most recently fetched file Document fingerprint set Never rejects documents as already having been seen
Database System Laboratory 7 URL aliases Four causes Host name aliases canonicalize coke.com and cocacola.com Omitted port numbers default value: 80 Alternative paths on the same host cannot avoid digital.com/index.html and digital.com/home.html Replication across different hosts cannot avoid Mirror sites Cannot avoid content-seen test
Database System Laboratory 8 Session IDs embedded in URLs Session identifiers To tract the browsing behavior of their visitors Create a potentially infinite set of URLs Represent a special case of alternative paths Document fingerprinting technique
Database System Laboratory 9 Crawler traps Crawler trap Cause a crawler to crawl indefinitely Unintentional: symbolic link Intentional: trap using CGI programs Antispam traps, traps to catch search engine crawlers Solution No automatic technique But traps are easily noticed Manually exclude the site Using the customizable URL filter
Database System Laboratory 10 Performance Digital Ultimate Workstation Two 533 MHz Alpha processors 2 GB of RAM and 118 GB of local disk Run in May million HTTP requests in 8 days 112 docs/sec and 1,682 KB/sec CPU cycle 37%: JIT-compiled Java bytecode 19%: Java runtime 44%: Unix kernel
Database System Laboratory 11 Selected Web statistics (1) Relationship between URLs and HTTP requests No. of URLs removed76,732,515 +No. of robots.txt requests3,675,634 - No. of excluded URLs3,050,768 =No. of HTTP requests77,357,381
Database System Laboratory 12 Selected Web statistics (2) Breakdown of HTTP status codes CodeMeaningNumberPercent 200OK65,790, % 404Not found5,617, % 302Moved temporarily2,517, % 301Moved permanently842, % 403Forbidden322, % 401Unauthorized223, % 500Internal server error83, % 406Not acceptable81, % 400Bad request65, % Other48, % Total75,593, % relatively low
Database System Laboratory 13 Selected Web statistics (3) Size of successfully downloaded documents 80%
Database System Laboratory 14 Selected Web statistics (4) Distribution of MIME types MIME typeNumberPercent text/html41,490, % image/gif10,729, % image/jpeg4,846,2578.1% text/plain869,9111.5% application/pdf540,6560.9% audio/x-pn-realaudio269,3840.4% application/zip213,0890.4% application/postscript159,8690.3% other829,4101.4% Total59,947, %
Database System Laboratory 15 Conclusions Use of Java Made implementation easier and more elegant Threads, garbage collection, objects, exception, etc. Scalability Extensibility Fin.