1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University
Economic Weaknesses Information Loss Information Overload Service Heterogeneity Physical Barriers Interoperability Value Filtering Mobile Access IP Infrastructure Archival Repository Technologies for Digital Libraries 2
Repository Multicast Engine WWW Feature Repository Retrieval Indexes Webbase API Web Crawler Web Crawler Web Crawler Web Crawlers Client WebBase Architecture 3
4 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages
5 Crawling Issues (1) Load at visited web sites Load at crawlers Scope of the crawl
6 Crawling Issues (2) Maintaining pages “fresh” –How does the web change over time? –What does fresh page/database mean? –How can we increase “freshness”?
7 Outline Web evolution experiments Freshness metrics Crawling policy
8 Web Evolution Experiment How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to change?
9 Experimental Setup February 17 to June 24, sites visited (with permission) –identified 400 sites with highest “page rank” –contacted administrators 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests
10 Page Change Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited
11 Average Change Interval fraction of pages
12 Change Interval - By Domain fraction of pages
13 Modeling Web Evolution Poisson process with rate T is time to next event f T (t) = e - t (t > 0)
14 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model
15 Change Metrics Freshness –Freshness of page e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase –Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) N 1 N i=1
16 Change Metrics Age –Age of page e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase –Age of the database S at time t is A( S ; t ) = A( e i ; t ) N 1 N i=1
17 Change Metrics F(e i ) A(e i ) time update refresh F( S ) = lim F(S ; t ) dt t 1 t 0 t F( e i ) = lim F(e i ; t ) dt t 1 t 0 t Time averages: similar for age...
18 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional] –? e1e1 e2e2 e1e1 e2e2 web database
19 Proportional Often Not Good! Visit fast changing e 1 get 1/2 day of freshness Visit slow changing e 2 get 1/2 week of freshness Visiting e 2 is a better deal!
20 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( )
21 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( )
22 Comparing Policies Based on Statistics from experiment and revisit frequency of every month
23 Summary Maintaining the collection fresh: –Web evolution experiment –Change metrics –Optimal policy Intuitive policy does not always perform well –Should be careful in deciding revisit policy