Download presentation
Presentation is loading. Please wait.
Published byPatricia Gray Modified over 6 years ago
1
UbiCrawler: a scalable fully distributed Web crawler
P. Boldi, B. Codenotti, M. Santini, and S. Vigna Software – Practice and Experience, Vol. 34, No. 8, pp , July 2004 June 20, 2006 Joo Yong Lee
2
Contents Performance evaluation Related works Conclusions
Fault tolerance Scalability Related works Conclusions
3
Degree of Distribution
Intra-site parallel crawler Run on the same local network Communicate through a high speed interconnect (such as LAN) Distributed crawler Run at geographically distant locations Communicate through the Internet (or WAN) UbiCrawler Be a distributed crawler Run on any kind of network Distributed Intra-site parallel
4
Coordination Independent Dynamic assignment Static assignment
No coordination Download pages totally independently Dynamic assignment Central coordinator Assign each partition to crawling process dynamically Static assignment Assign each partition to crawling process before they start to crawl UbiCrawler Dynamic coordination No central authority Distributed dynamic coordination
5
Partitioning Techniques
URL-hash based Based on the hash value of the URL of a page Many inter-partition links Site-hash based (or host-hash based) Compute the hash value only on the site name of a URL Reduce the number of inter-partition links Hierarchical Partition the Web hierarchically based on the URLs of pages Even fewer inter-partition links than the site-hash based scheme UbiCrawler Use a site-hash based Consistent hashing
6
Coverage Definition UbiCrawler
c / u, where c is the number of actually crawled pages and u the number of pages the crawler as a whole had to visit UbiCrawler Achieve optimal coverage 1, if no faults occur a b c e d f g h i j k l m n A1 (H1) A2 (H2) A3 (H3)
7
Overlap Definition UbiCrawler
(n – u) / u, where n is the total number of pages crawled by alive agents and u the number of unique pages UbiCrawler Achieve optimal overlap 0, even in the presence of crash faults Cannot guarantee the absence of duplications, if we consider transient faults Try to converge to a state with overlap 0 autonomously after a transient fault → self-stabilization
8
Communication Overhead
Definition e / n, where e is the number of URLs exchanged by the agents during the crawl and n is the number of crawled pages UbiCrawler Assuming that every page contains links to other sites, n crawled pages will give rise to URLs that must be potentially communicated to other agents Due to the balancing property of the assignment function, at most messages will be sent across the network, where ranges in the set of alive agents, and is the agent that fetched the page Communication overhead is less than
9
Quality (1/2) Challenge UbiCrawler
Crawlers cannot download the whole Web, and thus they try to download an “important” or “relevant” section of the Web Build a crawler that tends to collect high-quality pages during the early stages of the crawling process UbiCrawler Use a parallel per-host breadth-first visit, without dealing with ranking and quality-of-page issues Breadth-first search tends to visit high-quality pages first
10
Quality (2/2) UbiCrawler (Cont.) Limit on the depth of any host visit
Perform a pure breadth-first visit by setting the limit to 0 Resemble more and more a depth-first one by setting the limit to higher values Cumulative PageRank of crawled pages as a function of the number of crawled URLs
11
Fault Tolerance Metrics UbiCrawler
No commonly accepted metrics exist for estimating the fault tolerance of distributed crawlers up to now UbiCrawler Every agent has its own view of the set of alive agents Views can be different Two agents will never dispatch hosts to two different agents Agents can be dynamically added during a crawl a b c d e die
12
Page Recovery An Interesting features of contravariant assignment function is that they allow one to guess easily who could have fetched previously a page If a is responsible for the host h, then the agent responsible for h before a was started is the one associated to the next- nearest replica Implement a page recovery protocol Avoid re-fetching several times the same page Each time an agent is going to fetch a page of a host, it first checks whether the next-nearest t agents have already fetched that page
13
Scalability (1/3) Crawler should guarantee that the work performed by every thread is constant as the number of threads changes The system and communication overheads do not reduce the performance of each thread The performance of each UbiCrawler thread is independent of the number of agents
14
How work changes when the number of agents changes
Scalability (2/3) How work changes when the number of agents changes (solid line = 1, long dashes = 2, short dashes = 3, dots = 5, dash-dots = 8)
15
Scalability (3/3) How work changes when the number of threads changes
(from 2 to 14, higher to lower)
16
Related Works Mercator Spider High-performance Web crawler
Unique and central element, the frontier All the information about the set of URLs that have been crawled Ingenious mix of Rabin fingerprinting and compressed hash tables Several protocol modules (Gopher, ftp, etc.) Content-seen module Spider Be developed using C++ and Python Two central components crawl manager and crawl application Partition the set of URLs statically into k classes to solve bottleneck Domain-based throttling technique for polite crawling
17
Conclusions UbiCrawler is the first completely distributed crawler with identical agents UbiCrawler introduces the use of consistent hashing in parallel crawling Completely decentralize the coordination logic Graceful degradation in the presence of faults Linear scalability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.