Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

Similar presentations


Presentation on theme: "1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol."— Presentation transcript:

1 1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol. 30, num. 1-7, pp. 161-172, 1998 Presented by Crutcher Dunnavant University of Alabama in partial fulfillment of the requirements for Internet Algorithms course, Spring 2004

2 2 Motivation Crawler – n. a program that retrieves web pages, commonly for use by a search engine. A crawler has a set of visited URLs (for which it has a page downloaded) and unvisited URLs (for which it does not). Crawlers visit URLs by downloading the page they refer to, and adding any new URLs on that page to the unvisited set. Given that most crawlers will not be able to visit the entire web, how should crawlers select pages to visit from their unvisited URL set?

3 3 Limited Space Crawlers have limited storage capacity, and may be unable to index or analyze all pages. At the time of the writing of the paper in 1998, the Web contained about 1.5TB and was growing rapidly. Today in 2004, Google indexes over 6 billion pages. It is reasonable to expect that most clients will not want or will not be able to cope with all that data.

4 4 Limited Time Crawling takes time, so at some point the crawler may need to start revisiting previously scanned pages, to check for changes. This means that it may never get to some pages.

5 5 Ye Merrie Olde Internet We may safely assume that the internet is made up of pages of various levels of importance.

6 6 Important Pages The ideal situation: given a start page, immediately find the important pages, and crawl only those pages.

7 7 Most Important First Visit the most important pages first! But how do we determine the importance of a page? And how do we determine the importance of a page, given only the information available in the pages we have already downloaded?

8 8 Importance Metrics Similarity: IS(p, Q) – importance as the similarity of p to a query string Q. Backlink: IB(p) – importance as the count of pages which point to p. PageRank: IR(p) – importance as the PageRank[1] of the page p.

9 9 Importance Metrics (cont.) Forward Link: IF(p) – importance as the number of forward links on a page. Location: IL(p) – importance as the domain of a page (.gov,.edu, etc.)

10 10 Importance Estimators Given an importance metric IX(p) defined for a page on the web, the metric IX’(p) is an estimator of IX(p) calculated using only the information available in the downloaded information set. IX’(p) trends towards IX(p) as the information set grows towards a complete copy of the web.

11 11 Bad Metrics Not all metrics are of equal value, and not all metric estimates are useful. IS(p, Q) is only defined for a given driving query, so we must crawl the web for every new query. IF(p) is easily exploitable on the web, and yields poor values. IL(p) is incredibly naive, and does not permit ranking within a given domain.

12 12 Ordered Crawling Analyze the information already downloaded, and order the URLs not in the visited set by some importance metric on that data.

13 13 Crawl & Stop with Threshold Assume that the crawler visits K pages. Given an importance target G, any page with I(p) >= G is considered ‘hot’. Assume that the total number of hot pages is H. The performance of the crawler, PST(C), is the fraction of the H hot pages that have been visited when the crawler stops. If K < H, then an ideal crawler will have performance K/H, otherwise it will have the ideal performance 1. A random crawler that revisits pages is expected to visit (H/T)K hot pages when it stops, with a performance of K/T. Only when the random crawler visits all T pages its performance is expected to be 1.

14 14 Results

15 15 Personal Observations PageRank r0x0rs! This paper provides further justification for using PageRank as an estimator of importance in hyperlinked systems. Not only does it perform well on full data sets, it provides an effective guide towards discriminating consumption of partial data sets.

16 16 Personal Observation (cont.) Web Crawling maps well to the problem of teaching a human a new domain of knowledege. We seek to learn only the important portion of a domain, and we seek to learn it as quickly as possible. Can we apply PageRank or similar ordering metrics to scheduling the order of instruction?

17 17 Related work 1.Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry. "The PageRank Citation Ranking: Bringing Order to the Web.", http://dbpubs.stanford.edu/pub/1999-66 (This is the initial PageRank paper.)http://dbpubs.stanford.edu/pub/1999-66 2.Sergey Brin, Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." In Computer Networks and ISDN Systems 30 (1998) 107-117 (This is the initial Google paper.) 3.Taher Haveliwala. "Efficient computation of PageRank," Technical Report, September 1999. http://dbpubs.stanford.edu/pub/1999-31 (This paper describes fast ways to calculate PageRank) http://dbpubs.stanford.edu/pub/1999-31


Download ppt "1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol."

Similar presentations


Ads by Google