Presentation is loading. Please wait.

Presentation is loading. Please wait.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Similar presentations


Presentation on theme: "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques."— Presentation transcript:

1 Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques Robin ! Juliana Freire * *Univesity of Utah ! Universidade Federal de Pernambuco

2 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Library Scenario Library Office Shelves Add a book Update or remove a book Perfect scenario All the information up-to-date

3 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Other Library Scenario Library Office Shelves Add a book Update or remove a book Billions of books Change or remove their books in different rates Not enough resource to update everything

4 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Challenges Autonomous and independent sources Lots of data Billions of pages Dynamism 40% of Web pages change at least once a week (Cho and Molina, 2000) Applications run over limited resources Search engine coverage – 42% (Lawrence and Giles, 1999) Average time to search engine updates a page – 186 days (Lawrence and Giles, 1999) Update too often – waste resources Update sporadically – stale content

5 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Which Application Faces these Challenges? Proxy server Web archive E.g.: http://www.archive.org Search engine Stale content Broken links Updated pages not available in index Low quality of results

6 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Current Solutions Goal: update replicas only when needed Two main approaches: push and pull Push Site or user provides information about change frequency of pages E.g., google sitemaps requires cooperation Pull Application learns change frequency – no cooperation required Expensive to learn – need exhaustive crawls until frequencies are learned Can we do better?

7 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Our Solution Similar to “pull” approaches Predict change rate of pages Update pages based on this prediction Look at the present to reduce the cost of learning Take page content into account Page content gives good indication of its dynamism (Douglis et al, 1999) Quickly adapts to changes in update frequencies More efficient avoid unnecessary visits

8 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Updating Web Content: Our Solution Crawler New Page? Static Classifier Change History Change Predictions PageYes Historic Classifier No Change prediction Change prediction Page history Phase 1 Phase 2

9 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Solution: Overview 1. Gathering the training set 2. Creating the change rate groups 3. Learning static features 4. Learning from History

10 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Gathering the Training Set 100 most accessed sites of the Brazilian Web Representative sub-set of Web Interesting to Web user Breadth-first search down to depth 9 Total of 84.699 URLs 2/3 third used to build the classifers 1/3 third used to run the experimental evaluation Each page visited once a day for 100 days Result: Attributes of pages History of page changes– calculate the average change rate of each page

11 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Creating the Change Rate Groups Predict the average interval of time at which a given page is modified Regression task Discretizing the change rates Classification task Performed an unsupervised discretization Result:

12 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning Static Features Classify pages in modification groups Based on some static features Relation between some Web page attributes in its dynamism Dynamic pages are larger and have more images (Douglis et al, 1999) The absence of the HTTP header LAST-MODIFIED indicates more volatile pages (Brewington and Cybenko, 2000) Attributes used Presence of the HTTP header LAST-MODIFIED, file size in bytes, number of images, depth of a page in its domain...

13 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Feature Selection Determining the relevance of different features Make sure that the features are really relevant Wrapper method Uses induction algorithms Chooses the subset that results in the lowest error rate Result Depth of a page in its domain is not relevant Remaining features used in the static classifier

14 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Static Classifier Classify pages in modification groups Classification algorithms J48: decision tree, NaiveBayes: naïve bayes, IBk: k-nearest neighbor Measures of performance Error test rate Classification time Results AlgorithmsError test rateClassification time J48 without pruning11.92.41s J48 postpruning10.71.63s NaivesBayes40.5120.5s IBk with k=111.274393.15s IBk with k=211.886335.49s

15 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Historic classifier Classify pages in modification groups Based on change history Each modification class has: Average update rate E.g.:1 day, 3 days, 31 days and 96 days Windows size Number of visits to re-classified a page Class with lower average update rate has higher windows size and vice-versa Minimum and maximum change average thresholds

16 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Example: Two change frequency groups One week: windows size = 3, minimum threshold = 0.4 One month In T 0 Page P belongs to “one week” After 3 weeks P have not changed Average rate of P is 0 P is moved to “one month” group

17 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Experiment Compare against Bayesian estimator (Cho and Molina) First visit: randomly chosen Over time: bayesian inference 1/3 third of the monitored data:28,233 pages Performance measure: error rate Lower error rate: pages are visited close to the actual frequency

18 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Results ClassifierError rate Random75.22 J4825.64 Static classifier is more effective than no- assumption about the page behavior ConfigurationError rate Random + Bayesian34.73 J48 + Historic14.95 Combining historic and static gives the best performance

19 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Related Work Cho and Molina Uniform policy is always superior to the proportional (non-uniform) approach Overall freshness is maximized, their measure penalizes the most dynamic pages which may not be updated as frequently as they change Pandey and Olston User centric approach to guide the update process Maximize the expected improvement in repository quality. Non-uniform

20 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Conclusion Efficient strategy for keeping replicas of Web content current: Look at page contents Adapt quickly to change in update frequency Static classifier is effective Page contents are good indication of its change behavior Use static and historic information leads to improved performance Future work Take additional features into account, e.g., page rank and backlink Experiment with other learning techniques


Download ppt "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques."

Similar presentations


Ads by Google