Download presentation
Presentation is loading. Please wait.
Published byJoshua Briggs Modified over 9 years ago
1
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3
2
Spidering 24h, 7days “walking” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*50*10 9 = 500*10 9 1-entries in adj matrix
3
Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process
4
Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined
5
This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista) Also, two-level indexing with Front-coding compression
6
Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor
7
Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers Links are given to Central coordinator (?bottleneck?) Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web
8
Two problems with static assignment Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/…. www.di.unipi.it/ Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D D-1, new hash !!! What about new downloaders ? D D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use?
9
A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...
10
Examples: Open Source Nutch, also used by WikiSearch http://nutch.apache.org/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.