Download presentation
Presentation is loading. Please wait.
Published byAubrey Jayson Parker Modified over 9 years ago
1
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer
2
Information Retrieval Crawling
3
The Web’s Characteristics Size Billions of pages are available 5-40K per page => hundreds of terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days
4
Spidering 24h, 7days “walking” over a Graph, getting data What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 8 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*8*10 9 = 8*10 10 1-entries in adj matrix
5
A Picture of the Web Graph i j Q: sparse or not sparse? 21 millions of pages, 150millions of links
6
A special sorting Stanford Berkeley
7
A Picture of the Web Graph
8
Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor
9
Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process
10
Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined
11
BFS “…BFS-order discovers the highest quality pages during the early stages of the crawl” 328 millions of URL in the testbed [Najork 01]
12
This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is 50 to 75 bytes on average Overall we have about 10Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)
13
Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers Links are given to Central coordinator Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web
14
Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/…. www.di.unipi.it/ Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D D-1, new hash !!! What about new downloaders ? D D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to [0,D). Dowloader x fetches the URLs U s.t. hash(U) [x-1,x)
15
A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers ← ID(hash of m bits) Server mapped on a unit circle Item k assigned to first server with ID ≥ k What if a downloader goes down? What if a new downloader appears? Theorem. Given S servers and I items, map on the unit circle (log S) copies of each server and the I items. Then [load] any server gets ≤ (I/S) log S items [spread] any URL is stored in ≤ (log S) servers
16
Examples: Open Source Nutch, also used by Overture http://www.nutch.org Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.