Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229,

Similar presentations


Presentation on theme: "Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229,"— Presentation transcript:

1 Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim

2 Database System Laboratory 2 Content Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions

3 Database System Laboratory 3 Extensibility Extend with new functionality New protocol and processing modules Different versions of most of its major components Ingredients Interface  an abstract class Mechanism  a configuration file Infrastructure

4 Database System Laboratory 4 Protocol and processing modules Abstract Protocol class fetch method: download the document newURL method: parse a given string Abstract Analyzer class process method: process it appropriately Different Analyzer subclasses GifStats TagCounter WebLinter: runs the Weblint program

5 Database System Laboratory 5 Alternative URL frontier Drawback on intranet Multiple hosts might be assigned to the same thread Solution URL frontier component that dynamically assigns host Maximized the number of busy worker threads Is well-suited to host-limited crawls

6 Database System Laboratory 6 As a random walker Random walker Starts at a random page taken from a set of seeds The next page is selected by choosing a random link Differences A page may be revisited multiple times Only one link is followed each time To support random walking A new URL frontier Records only the URLs discovered most recently fetched file Document fingerprint set Never rejects documents as already having been seen

7 Database System Laboratory 7 URL aliases Four causes Host name aliases  canonicalize coke.com and cocacola.com  203.134.241.178 Omitted port numbers  default value: 80 Alternative paths on the same host  cannot avoid digital.com/index.html and digital.com/home.html Replication across different hosts  cannot avoid Mirror sites Cannot avoid  content-seen test

8 Database System Laboratory 8 Session IDs embedded in URLs Session identifiers To tract the browsing behavior of their visitors Create a potentially infinite set of URLs Represent a special case of alternative paths Document fingerprinting technique

9 Database System Laboratory 9 Crawler traps Crawler trap Cause a crawler to crawl indefinitely Unintentional: symbolic link Intentional: trap using CGI programs Antispam traps, traps to catch search engine crawlers Solution No automatic technique But traps are easily noticed Manually exclude the site Using the customizable URL filter

10 Database System Laboratory 10 Performance Digital Ultimate Workstation Two 533 MHz Alpha processors 2 GB of RAM and 118 GB of local disk Run in May 1999 77.4 million HTTP requests in 8 days 112 docs/sec and 1,682 KB/sec CPU cycle 37%: JIT-compiled Java bytecode 19%: Java runtime 44%: Unix kernel

11 Database System Laboratory 11 Selected Web statistics (1) Relationship between URLs and HTTP requests No. of URLs removed76,732,515 +No. of robots.txt requests3,675,634 - No. of excluded URLs3,050,768 =No. of HTTP requests77,357,381

12 Database System Laboratory 12 Selected Web statistics (2) Breakdown of HTTP status codes CodeMeaningNumberPercent 200OK65,790,95387.03% 404Not found5,617,4917.43% 302Moved temporarily2,517,7053.33% 301Moved permanently842,8751.12% 403Forbidden322,0420.43% 401Unauthorized223,8430.30% 500Internal server error83,7440.11% 406Not acceptable81,0910.11% 400Bad request65,1590.09% Other48,6280.06% Total75,593,531100.0% relatively low

13 Database System Laboratory 13 Selected Web statistics (3) Size of successfully downloaded documents 80%

14 Database System Laboratory 14 Selected Web statistics (4) Distribution of MIME types MIME typeNumberPercent text/html41,490,04469.2% image/gif10,729,32617.9% image/jpeg4,846,2578.1% text/plain869,9111.5% application/pdf540,6560.9% audio/x-pn-realaudio269,3840.4% application/zip213,0890.4% application/postscript159,8690.3% other829,4101.4% Total59,947,946100.0%

15 Database System Laboratory 15 Conclusions Use of Java Made implementation easier and more elegant Threads, garbage collection, objects, exception, etc. Scalability Extensibility Fin.


Download ppt "Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229,"

Similar presentations


Ads by Google