Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

Similar presentations


Presentation on theme: "How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho."— Presentation transcript:

1 How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

2 Economic Concerns Information Loss Information Overload Service Heterogeneity Physical Barriers Interoperability Value Filtering Mobile Access IP Infrastructure Archival Repository Stanford InterLib Technologies

3 Repository Multicast Engine WWW Feature Repository Retrieval Indexes Webbase API Web Crawler Web Crawler Web Crawler Web Crawlers Client WebBase Architecture

4 4 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

5 5 Crawling at Stanford WebBase Project BackRub Search Engine, Page Rank Google New WebBase Crawler

6 6 Crawling Issues (1) Load at visited web sites –robots.txt files –space out requests to a site –limit number of requests to a site per day –limit depth of crawl –multicast

7 7 Crawling Issues (2) Load at crawler –parallelize init get next url get page extract urls initial urls to visit urls visited urls web pages init get next url get page extract urls ?

8 8 Crawling Issues (3) Scope of crawl –not enough space for “all” pages –not enough time to visit “all” pages Solution: Visit “important” pages visited pages microsoft

9 9 Crawling Issues (4) Incremental crawling –how do we keep pages “fresh”? –how do we avoid crawling from scratch? Focus of this talk...

10 10 Web Evolution Experiment How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to change? How do we model web changes?

11 11 Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) –identified 400 sites with highest “page rank” –contacted administrators 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests

12 12 How Often Does a Page Change? Example: 50 visits to page, 5 changes  average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited

13 13 Average Change Interval fraction of pages

14 14 Average Change Interval — By Domain fraction of pages

15 15 How Long Does a Page Live? experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime

16 16 Page Lifespans fraction of pages

17 17 Page Lifespans Method 1 used fraction of pages

18 18 Time for a 50% Change days fraction of unchanged pages

19 19 Modeling Web Evolution Poisson process with rate T is time to next event f T (t) = e - t (t > 0)

20 20 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model

21 21 Change Metrics Freshness –Freshness of element e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase –Freshness of the database S at time t is F( S ; t ) = F( e i ; t )  N 1 N i=1

22 22 Change Metrics Age –Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase –Age of the database S at time t is A( S ; t ) = A( e i ; t )  N 1 N i=1

23 23 Change Metrics F(e i ) A(e i ) 0 0 1 time update refresh F( S ) = lim F(S ; t ) dt  t 1 t 0 t  F( e i ) = lim F(e i ; t ) dt  t 1 t 0 t  Time averages: similar for age...

24 24 Crawler Types In-place vs. shadow Steady vs. batch eiei eiei... webdatabase eiei... shadow database time crawler on crawler off

25 25 Comparison: Batch vs. Steady batch mode in-place crawler steady in-place crawler crawler running

26 26 Shadowing Steady Crawler crawler’s collection current collection without shadowing

27 27 Shadowing Batch Crawler crawler’s collection current collection without shadowing

28 28 Experimental Data  Freshness Pages change on average every 4 months Batch crawler works one week out of 4 1 2 0.63 0.50

29 29 Refresh Order Fixed order –Example: Explicit list of URLs to visit Random Order –Example: Start from seed URLs & follow links Purely Random –Example: Refresh pages on demand, as requested by user eiei eiei... web database

30 30 Freshness vs. Order steady, in-place crawler, pages change at same average rate r = / f = average change frequency / average visit frequency

31 31 Age vs. Order r = / f = average change frequency / average visit frequency = Age / time to refresh all N elements

32 32 Non-Uniform Change Rates fraction of elements g( )

33 33 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional] –? e1e1 e2e2 e1e1 e2e2 web database

34 34 Proportional Often Not Good! Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

35 35 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( )

36 36 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( )

37 37 Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]

38 38 Building an Incremental Crawler Steady In-place Variable visit frequencies Fixed order improves freshness! Also need: –Page change frequency estimator –Page replacement policy –Page ranking policy (what are ‘important” pages?) See papers....

39 39 The End Thank you for your attention... For more information, visit http://www-db.stanford.edu Follow “Searchable Publications” link, and search for author = “Cho”


Download ppt "How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho."

Similar presentations


Ads by Google