Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Similar presentations


Presentation on theme: "Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:"— Presentation transcript:

1 Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter: Suchitra Manepalli

2 Jan 27, 20052 Searching on the Web Information Overload Indexing  Google, Alta-Vista Integration  BizRate Focus on Indexing

3 Jan 27, 20053 How Google Works? Copyright © 2003 Google Inc.

4 Jan 27, 20054 Crawling web init get next url get page extract urls initial urls to visit urls visited urls web pages Taken from Cho Thesis

5 Jan 27, 20055 Challenges Page selection and scrape  What page to scrape? Page and index update  How to update pages? Page ranking  What page is “important” or “relevant”?  Determine “Canonical” copy? Scalability  What is the maximum number of pages that we can afford to “index”?

6 Jan 27, 20056 Focusing Page selection and scrape  What page to scrape? Page and index update  How to update pages? Page ranking  What page is “important” or “relevant”?  Determine “Canonical” copy? Scalability  What is the maximum number of pages that we can afford to “index”?

7 Jan 27, 20057

8 8

9 9

10 10

11 Jan 27, 200511 Presentation Outline Introduction Problems Framework – Effective Solutions Different policies Weighted Freshness Experiments Conclusion

12 Jan 27, 200512 Introduction Between web-crawling, the web-site changes in-deterministically Main Issue  How often do we crawl?

13 Jan 27, 200513 Questions H ow can we maintain pages fresh?  What are “fresh” pages?  How often should the index be maintained?  What constraints are posted?  What are the refresh policies?  How effective are the refresh policies?

14 Jan 27, 200514 “Freshness” Assuming each element is equally important Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise Freshness of the database S at time t is F( S ; t ) = F( e i ; t )  N 1 N i=1 eiei eiei... webdatabase

15 Jan 27, 200515 “Age” Assume equal importance of pages Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t - (modification e i time) otherwise Age of the database S at time t is A( S ; t ) = A( e i ; t )  N 1 N i=1 eiei eiei... webdatabase

16 Jan 27, 200516 “Freshness” and “Age” F(e i ) A(e i ) 0 0 1 time update refresh

17 Jan 27, 200517 Poisson process Real world  Elements are modified by a Poisson process  Happen randomly and independently with a fixed rate over time

18 Jan 27, 200518 “Expected” - Variables Next event occurs in a Poisson process with change rate λ Probability of e i changes at least once in the time interval (0,t] is

19 Jan 27, 200519 “Expected” - Equations Expected Freshness Expected Age

20 Jan 27, 200520 “Expected” - Graphs

21 Jan 27, 200521 Evolution Model of Database Uniform Change Frequency Model  All real-world elements change at the same frequency λ Individual element changes over time All elements change at the same average rate Non-Uniform Change Frequency Model  Elements change at different rates

22 Jan 27, 200522 Histogram of Change Frequencies

23 Jan 27, 200523 Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points

24 Jan 27, 200524 Synchronization Policies Synchronization Frequency  How frequently do we synchronize the database  More often, more fresher Resource Allocation  How frequently we should synchronize each individual element  Uniform Allocation Policy  Non-Uniform Allocation Policy

25 Jan 27, 200525 Synchronization Policies Synchronization Order  What order we need to synchronize the elements?  Fixed order Same order repeatedly  Random order Synchronization order is different in each iteration  Purely random At each synchronization point, we select a random element from the database and synchronize it

26 Jan 27, 200526 Synchronization Policies Synchronization Points

27 Jan 27, 200527 Synchronization Order - Policies Fixed order policy

28 Jan 27, 200528 Synchronization Order - Policies Random order policy

29 Jan 27, 200529 Synchronization Order - Policies Purely Random Policy

30 Jan 27, 200530 Comparison

31 Jan 27, 200531 Resource Allocation Policies What can we do if the elements change at different rates and we know how often each element changes? Is it better to synchronize an element more often when it changes more often? Is it better to synchronize equally?

32 Jan 27, 200532 Trick Question Two page database e 1 changes daily e 2 changes once a week We can visit one page per week How should we visit pages?  e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform]  e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional]

33 Jan 27, 200533 Proportional is often not good Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

34 Jan 27, 200534 Uniform versus Proportional Intuitively assume, proportional allocation policy performs better than uniform policy Two element database Uniform policy is actually better To improve freshness we should penalize the elements that change too often

35 Jan 27, 200535 Weighted Freshness If elements have different importance? Synchronize the elements to maximize the freshness of the database perceived by the users? Refresh one more than the other?

36 Jan 27, 200536 Weighted Freshness Metrics To capture the concept: weights are given Freshness:  Age: 

37 Jan 27, 200537 Experimental Setup 270 sites visited  identified 400 sites with highest “PageRank”  contacted administrators February 17 to June 24, 1999 3,000 pages from each site daily  start at root, visit breadth first (get new & old pages)  ran only 9pm - 6am, 10 seconds between site requests

38 Jan 27, 200538 Change interval of pages Pages change 10 days

39 Jan 27, 200539 Results Indicate a Poisson curve as predicted Constraint:  Crawled web pages on a daily basis  Does not verify for pages that change: Very often Less frequent Typical crawling rate of search engines, exact change is of relative importance For example: Google

40 Jan 27, 200540 Experiment 2: Synchronization-order Selected pages with average change frequency : Two weeks Simulated multiple crawls:  Once a day  Once every week  Once every month  Once every two months Assumed page changed in middle of the day

41 Jan 27, 200541 Synchronization-order policy

42 Jan 27, 200542 Results Theoretical implications  How can we measure how fresh a local database is?  How can we guarantee certain freshness of a local database?

43 Jan 27, 200543 Experiment 3: Frequency of Change Average change interval of a page  Dividing monitoring period by the number of detected changes in a page  Page changed 4 times in 4 month period Estimate the average change interval of the page: 4 months/4 = 1 month

44 Jan 27, 200544 Frequency of Change

45 Jan 27, 200545 Results Pages maintained at commercial sites:  Updated frequently Gives, reasonable average change interval for most pages Estimation may not be accurate  If page changes more than once every day  If page changes several times a day, but remains static for a week

46 Jan 27, 200546 Experiment 4: Resource-Allocation How frequently we synchronize each group Previous experiment:  23% of pages change every day  15% change every week  That did not change for 4 months, changes for a year Tests for :  Proportional  Uniform  Optimal

47 Jan 27, 200547 Resource Allocation Policy

48 Jan 27, 200548 Results Proportional policy performs very poorly when pages change very often Optimal policy becomes relatively more effective than the uniform policy Lesson learned:  Optimal policy performs better to monitor frequently changing information

49 Jan 27, 200549 Conclusion Proportional Synchronization Policy  Intuitive appealing  Does not work well Optimal policies  Improve freshness and age significantly using real web data

50 Jan 27, 200550 Conclusion Two Metrics  “Freshness”  “Age” Synchronization Policies  Synchronization Frequency  Resource Allocation  Synchronization Order  Synchronization Points

51 Jan 27, 200551 References http://oak.cs.ucla.edu/~cho/  Interesting information: talks, experiments, publications, course material http://www.googleguide.com/google_works.html  How google works http://www.google.com/remove.html  Google indexing tips http://www.seorank.com/google-pagerank.htm  Google Page rank algorithm explained


Download ppt "Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:"

Similar presentations


Ads by Google