Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Similar presentations


Presentation on theme: "Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University."— Presentation transcript:

1 Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

2 2 Application –Web search engines/crawlers –Data warehouse... Problem Polling Remote database Local database Query Update

3 3 Challenge: How to maintain pages “fresh?” How does the web change over time? –Web evolution experiment What does fresh page/database mean? –Change metrics How can we increase “freshness”? –Crawl policy

4 4 Web Evolution Experiment How often does a web page change? How do we model web changes? What is the lifespan of a page? How long does it take for 50% of the web change?

5 5 Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) –identified 400 sites with highest “page rank” –contacted administrators 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests

6 6 How Often Does a Page Change? Example: 50 visits to page, 5 changes  average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited

7 7 Average Change Interval fraction of pages

8 8 Modeling Web Evolution Poisson process with rate T is time to next event f T (t) = e - t (t > 0)

9 9 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model

10 10 Change Metrics Freshness –Freshness of page e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase –Freshness of the database S at time t is F( S ; t ) = F( e i ; t )  N 1 N i=1

11 11 Change Metrics Age –Age of page e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase –Age of the database S at time t is A( S ; t ) = A( e i ; t )  N 1 N i=1

12 12 Change Metrics F(e i ) A(e i ) 0 0 1 time update refresh F( S ) = lim F(S ; t ) dt  t 1 t 0 t  F( e i ) = lim F(e i ; t ) dt  t 1 t 0 t  Time averages: similar for age...

13 13 Refresh Order Fixed order –Example: Explicit list of URLs to visit Random Order –Example: Start from seed URLs & follow links Purely Random –Example: Refresh pages on demand, as requested by user eiei eiei... web database

14 14 Freshness vs. Order r = / f = average change frequency / average revisit frequency

15 15 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional] –? e1e1 e2e2 e1e1 e2e2 web database

16 16 Proportional Often Not Good! Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

17 17 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( )

18 18 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( )

19 19 Comparing Policies Based on Statistics from experiment and revisit frequency of every month

20 20 Summary Maintaining the collection fresh: –Web evolution experiment –Change metrics –Optimal policy Intuitive policy does not always perform well –Should be careful in deciding revisit policy

21 21 Future work Weighted freshness model Non-Poisson process model Change frequency estimation


Download ppt "Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University."

Similar presentations


Ads by Google