Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter: Suchitra Manepalli
Jan 27, Searching on the Web Information Overload Indexing Google, Alta-Vista Integration BizRate Focus on Indexing
Jan 27, How Google Works? Copyright © 2003 Google Inc.
Jan 27, Crawling web init get next url get page extract urls initial urls to visit urls visited urls web pages Taken from Cho Thesis
Jan 27, Challenges Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”?
Jan 27, Focusing Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”?
Jan 27, 20057
8
9
10
Jan 27, Presentation Outline Introduction Problems Framework – Effective Solutions Different policies Weighted Freshness Experiments Conclusion
Jan 27, Introduction Between web-crawling, the web-site changes in-deterministically Main Issue How often do we crawl?
Jan 27, Questions H ow can we maintain pages fresh? What are “fresh” pages? How often should the index be maintained? What constraints are posted? What are the refresh policies? How effective are the refresh policies?
Jan 27, “Freshness” Assuming each element is equally important Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) N 1 N i=1 eiei eiei... webdatabase
Jan 27, “Age” Assume equal importance of pages Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t - (modification e i time) otherwise Age of the database S at time t is A( S ; t ) = A( e i ; t ) N 1 N i=1 eiei eiei... webdatabase
Jan 27, “Freshness” and “Age” F(e i ) A(e i ) time update refresh
Jan 27, Poisson process Real world Elements are modified by a Poisson process Happen randomly and independently with a fixed rate over time
Jan 27, “Expected” - Variables Next event occurs in a Poisson process with change rate λ Probability of e i changes at least once in the time interval (0,t] is
Jan 27, “Expected” - Equations Expected Freshness Expected Age
Jan 27, “Expected” - Graphs
Jan 27, Evolution Model of Database Uniform Change Frequency Model All real-world elements change at the same frequency λ Individual element changes over time All elements change at the same average rate Non-Uniform Change Frequency Model Elements change at different rates
Jan 27, Histogram of Change Frequencies
Jan 27, Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points
Jan 27, Synchronization Policies Synchronization Frequency How frequently do we synchronize the database More often, more fresher Resource Allocation How frequently we should synchronize each individual element Uniform Allocation Policy Non-Uniform Allocation Policy
Jan 27, Synchronization Policies Synchronization Order What order we need to synchronize the elements? Fixed order Same order repeatedly Random order Synchronization order is different in each iteration Purely random At each synchronization point, we select a random element from the database and synchronize it
Jan 27, Synchronization Policies Synchronization Points
Jan 27, Synchronization Order - Policies Fixed order policy
Jan 27, Synchronization Order - Policies Random order policy
Jan 27, Synchronization Order - Policies Purely Random Policy
Jan 27, Comparison
Jan 27, Resource Allocation Policies What can we do if the elements change at different rates and we know how often each element changes? Is it better to synchronize an element more often when it changes more often? Is it better to synchronize equally?
Jan 27, Trick Question Two page database e 1 changes daily e 2 changes once a week We can visit one page per week How should we visit pages? e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional]
Jan 27, Proportional is often not good Visit fast changing e 1 get 1/2 day of freshness Visit slow changing e 2 get 1/2 week of freshness Visiting e 2 is a better deal!
Jan 27, Uniform versus Proportional Intuitively assume, proportional allocation policy performs better than uniform policy Two element database Uniform policy is actually better To improve freshness we should penalize the elements that change too often
Jan 27, Weighted Freshness If elements have different importance? Synchronize the elements to maximize the freshness of the database perceived by the users? Refresh one more than the other?
Jan 27, Weighted Freshness Metrics To capture the concept: weights are given Freshness: Age:
Jan 27, Experimental Setup 270 sites visited identified 400 sites with highest “PageRank” contacted administrators February 17 to June 24, ,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests
Jan 27, Change interval of pages Pages change 10 days
Jan 27, Results Indicate a Poisson curve as predicted Constraint: Crawled web pages on a daily basis Does not verify for pages that change: Very often Less frequent Typical crawling rate of search engines, exact change is of relative importance For example: Google
Jan 27, Experiment 2: Synchronization-order Selected pages with average change frequency : Two weeks Simulated multiple crawls: Once a day Once every week Once every month Once every two months Assumed page changed in middle of the day
Jan 27, Synchronization-order policy
Jan 27, Results Theoretical implications How can we measure how fresh a local database is? How can we guarantee certain freshness of a local database?
Jan 27, Experiment 3: Frequency of Change Average change interval of a page Dividing monitoring period by the number of detected changes in a page Page changed 4 times in 4 month period Estimate the average change interval of the page: 4 months/4 = 1 month
Jan 27, Frequency of Change
Jan 27, Results Pages maintained at commercial sites: Updated frequently Gives, reasonable average change interval for most pages Estimation may not be accurate If page changes more than once every day If page changes several times a day, but remains static for a week
Jan 27, Experiment 4: Resource-Allocation How frequently we synchronize each group Previous experiment: 23% of pages change every day 15% change every week That did not change for 4 months, changes for a year Tests for : Proportional Uniform Optimal
Jan 27, Resource Allocation Policy
Jan 27, Results Proportional policy performs very poorly when pages change very often Optimal policy becomes relatively more effective than the uniform policy Lesson learned: Optimal policy performs better to monitor frequently changing information
Jan 27, Conclusion Proportional Synchronization Policy Intuitive appealing Does not work well Optimal policies Improve freshness and age significantly using real web data
Jan 27, Conclusion Two Metrics “Freshness” “Age” Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points
Jan 27, References Interesting information: talks, experiments, publications, course material How google works Google indexing tips Google Page rank algorithm explained