Download presentation
Presentation is loading. Please wait.
Published bySpencer Turner Modified over 8 years ago
1
Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter: Suchitra Manepalli
2
Jan 27, 20052 Searching on the Web Information Overload Indexing Google, Alta-Vista Integration BizRate Focus on Indexing
3
Jan 27, 20053 How Google Works? Copyright © 2003 Google Inc.
4
Jan 27, 20054 Crawling web init get next url get page extract urls initial urls to visit urls visited urls web pages Taken from Cho Thesis
5
Jan 27, 20055 Challenges Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”?
6
Jan 27, 20056 Focusing Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”?
7
Jan 27, 20057
8
8
9
9
10
10
11
Jan 27, 200511 Presentation Outline Introduction Problems Framework – Effective Solutions Different policies Weighted Freshness Experiments Conclusion
12
Jan 27, 200512 Introduction Between web-crawling, the web-site changes in-deterministically Main Issue How often do we crawl?
13
Jan 27, 200513 Questions H ow can we maintain pages fresh? What are “fresh” pages? How often should the index be maintained? What constraints are posted? What are the refresh policies? How effective are the refresh policies?
14
Jan 27, 200514 “Freshness” Assuming each element is equally important Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) N 1 N i=1 eiei eiei... webdatabase
15
Jan 27, 200515 “Age” Assume equal importance of pages Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t - (modification e i time) otherwise Age of the database S at time t is A( S ; t ) = A( e i ; t ) N 1 N i=1 eiei eiei... webdatabase
16
Jan 27, 200516 “Freshness” and “Age” F(e i ) A(e i ) 0 0 1 time update refresh
17
Jan 27, 200517 Poisson process Real world Elements are modified by a Poisson process Happen randomly and independently with a fixed rate over time
18
Jan 27, 200518 “Expected” - Variables Next event occurs in a Poisson process with change rate λ Probability of e i changes at least once in the time interval (0,t] is
19
Jan 27, 200519 “Expected” - Equations Expected Freshness Expected Age
20
Jan 27, 200520 “Expected” - Graphs
21
Jan 27, 200521 Evolution Model of Database Uniform Change Frequency Model All real-world elements change at the same frequency λ Individual element changes over time All elements change at the same average rate Non-Uniform Change Frequency Model Elements change at different rates
22
Jan 27, 200522 Histogram of Change Frequencies
23
Jan 27, 200523 Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points
24
Jan 27, 200524 Synchronization Policies Synchronization Frequency How frequently do we synchronize the database More often, more fresher Resource Allocation How frequently we should synchronize each individual element Uniform Allocation Policy Non-Uniform Allocation Policy
25
Jan 27, 200525 Synchronization Policies Synchronization Order What order we need to synchronize the elements? Fixed order Same order repeatedly Random order Synchronization order is different in each iteration Purely random At each synchronization point, we select a random element from the database and synchronize it
26
Jan 27, 200526 Synchronization Policies Synchronization Points
27
Jan 27, 200527 Synchronization Order - Policies Fixed order policy
28
Jan 27, 200528 Synchronization Order - Policies Random order policy
29
Jan 27, 200529 Synchronization Order - Policies Purely Random Policy
30
Jan 27, 200530 Comparison
31
Jan 27, 200531 Resource Allocation Policies What can we do if the elements change at different rates and we know how often each element changes? Is it better to synchronize an element more often when it changes more often? Is it better to synchronize equally?
32
Jan 27, 200532 Trick Question Two page database e 1 changes daily e 2 changes once a week We can visit one page per week How should we visit pages? e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional]
33
Jan 27, 200533 Proportional is often not good Visit fast changing e 1 get 1/2 day of freshness Visit slow changing e 2 get 1/2 week of freshness Visiting e 2 is a better deal!
34
Jan 27, 200534 Uniform versus Proportional Intuitively assume, proportional allocation policy performs better than uniform policy Two element database Uniform policy is actually better To improve freshness we should penalize the elements that change too often
35
Jan 27, 200535 Weighted Freshness If elements have different importance? Synchronize the elements to maximize the freshness of the database perceived by the users? Refresh one more than the other?
36
Jan 27, 200536 Weighted Freshness Metrics To capture the concept: weights are given Freshness: Age:
37
Jan 27, 200537 Experimental Setup 270 sites visited identified 400 sites with highest “PageRank” contacted administrators February 17 to June 24, 1999 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests
38
Jan 27, 200538 Change interval of pages Pages change 10 days
39
Jan 27, 200539 Results Indicate a Poisson curve as predicted Constraint: Crawled web pages on a daily basis Does not verify for pages that change: Very often Less frequent Typical crawling rate of search engines, exact change is of relative importance For example: Google
40
Jan 27, 200540 Experiment 2: Synchronization-order Selected pages with average change frequency : Two weeks Simulated multiple crawls: Once a day Once every week Once every month Once every two months Assumed page changed in middle of the day
41
Jan 27, 200541 Synchronization-order policy
42
Jan 27, 200542 Results Theoretical implications How can we measure how fresh a local database is? How can we guarantee certain freshness of a local database?
43
Jan 27, 200543 Experiment 3: Frequency of Change Average change interval of a page Dividing monitoring period by the number of detected changes in a page Page changed 4 times in 4 month period Estimate the average change interval of the page: 4 months/4 = 1 month
44
Jan 27, 200544 Frequency of Change
45
Jan 27, 200545 Results Pages maintained at commercial sites: Updated frequently Gives, reasonable average change interval for most pages Estimation may not be accurate If page changes more than once every day If page changes several times a day, but remains static for a week
46
Jan 27, 200546 Experiment 4: Resource-Allocation How frequently we synchronize each group Previous experiment: 23% of pages change every day 15% change every week That did not change for 4 months, changes for a year Tests for : Proportional Uniform Optimal
47
Jan 27, 200547 Resource Allocation Policy
48
Jan 27, 200548 Results Proportional policy performs very poorly when pages change very often Optimal policy becomes relatively more effective than the uniform policy Lesson learned: Optimal policy performs better to monitor frequently changing information
49
Jan 27, 200549 Conclusion Proportional Synchronization Policy Intuitive appealing Does not work well Optimal policies Improve freshness and age significantly using real web data
50
Jan 27, 200550 Conclusion Two Metrics “Freshness” “Age” Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points
51
Jan 27, 200551 References http://oak.cs.ucla.edu/~cho/ Interesting information: talks, experiments, publications, course material http://www.googleguide.com/google_works.html How google works http://www.google.com/remove.html Google indexing tips http://www.seorank.com/google-pagerank.htm Google Page rank algorithm explained
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.