Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer
Information Retrieval Crawling
The Web’s Characteristics Size Billions of pages are available 5-40K per page => hundreds of terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days
Spidering 24h, 7days “walking” over a Graph, getting data What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 8 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*8*10 9 = 8* entries in adj matrix
A Picture of the Web Graph i j Q: sparse or not sparse? 21 millions of pages, 150millions of links
A special sorting Stanford Berkeley
A Picture of the Web Graph
Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor
Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process
Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined
BFS “…BFS-order discovers the highest quality pages during the early stages of the crawl” 328 millions of URL in the testbed [Najork 01]
This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is 50 to 75 bytes on average Overall we have about 10Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)
Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers Links are given to Central coordinator Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web
Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D D-1, new hash !!! What about new downloaders ? D D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to [0,D). Dowloader x fetches the URLs U s.t. hash(U) [x-1,x)
A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers ← ID(hash of m bits) Server mapped on a unit circle Item k assigned to first server with ID ≥ k What if a downloader goes down? What if a new downloader appears? Theorem. Given S servers and I items, map on the unit circle (log S) copies of each server and the I items. Then [load] any server gets ≤ (I/S) log S items [spread] any URL is stored in ≤ (log S) servers
Examples: Open Source Nutch, also used by Overture Hentrix, used by Archive.org