Download presentation
Presentation is loading. Please wait.
1
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University
2
2 Application –Web search engines/crawlers –Data warehouse... Problem Polling Remote database Local database Query Update
3
3 Challenge: How to maintain pages “fresh?” How does the web change over time? –Web evolution experiment What does fresh page/database mean? –Change metrics How can we increase “freshness”? –Crawl policy
4
4 Web Evolution Experiment How often does a web page change? How do we model web changes? What is the lifespan of a page? How long does it take for 50% of the web change?
5
5 Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) –identified 400 sites with highest “page rank” –contacted administrators 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests
6
6 How Often Does a Page Change? Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited
7
7 Average Change Interval fraction of pages
8
8 Modeling Web Evolution Poisson process with rate T is time to next event f T (t) = e - t (t > 0)
9
9 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model
10
10 Change Metrics Freshness –Freshness of page e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase –Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) N 1 N i=1
11
11 Change Metrics Age –Age of page e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase –Age of the database S at time t is A( S ; t ) = A( e i ; t ) N 1 N i=1
12
12 Change Metrics F(e i ) A(e i ) 0 0 1 time update refresh F( S ) = lim F(S ; t ) dt t 1 t 0 t F( e i ) = lim F(e i ; t ) dt t 1 t 0 t Time averages: similar for age...
13
13 Refresh Order Fixed order –Example: Explicit list of URLs to visit Random Order –Example: Start from seed URLs & follow links Purely Random –Example: Refresh pages on demand, as requested by user eiei eiei... web database
14
14 Freshness vs. Order r = / f = average change frequency / average revisit frequency
15
15 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional] –? e1e1 e2e2 e1e1 e2e2 web database
16
16 Proportional Often Not Good! Visit fast changing e 1 get 1/2 day of freshness Visit slow changing e 2 get 1/2 week of freshness Visiting e 2 is a better deal!
17
17 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( )
18
18 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( )
19
19 Comparing Policies Based on Statistics from experiment and revisit frequency of every month
20
20 Summary Maintaining the collection fresh: –Web evolution experiment –Change metrics –Optimal policy Intuitive policy does not always perform well –Should be careful in deciding revisit policy
21
21 Future work Weighted freshness model Non-Poisson process model Change frequency estimation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.