Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Feature Selection for Regression Problems

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

Lecture 5 (Classification with Decision Trees)

1 Drafting Behind Akamai (Travelocity-Based Detouring) AoJan Su, David R. Choffnes, Aleksandar Kuzmanovic, and Fabian E. Bustamante Department of Electrical.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

How Search Engines Work Source:

A glance at the world of search engines July 2005 Matias Cuenca-Acuna Research Scientist Teoma Search Development.

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Hybrid Prefetching for WWW Proxy Servers Yui-Wen Horng, Wen-Jou Lin, Hsing Mei Department of Computer Science and Information Engineering Fu Jen Catholic.

Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.

by B. Zadrozny and C. Elkan

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

The identification of interesting web sites Presented by Xiaoshu Cai.

1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Time Management Personal and Project. Why is it important Time management is directly relevant to Project Management If we cannot manage our own time.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

1 Searching the Web Representation and Management of Data on the Internet.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

Search Engine-Crawler Symbiosis: Adapting to Community Interests

Data Mining and Decision Support

Evolution of Web from a Search Engine Perspective Saket Singam

Classification using Co-Training

Classification and Regression Trees

Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.

Revision (Part II) Ke Chen

Robotic Search Engines for the Physical World

Chapter 7: Transformations

Presentation transcript:

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques Robin ! Juliana Freire * *Univesity of Utah ! Universidade Federal de Pernambuco

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Library Scenario Library Office Shelves Add a book Update or remove a book Perfect scenario All the information up-to-date

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Other Library Scenario Library Office Shelves Add a book Update or remove a book Billions of books Change or remove their books in different rates Not enough resource to update everything

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Challenges Autonomous and independent sources Lots of data Billions of pages Dynamism 40% of Web pages change at least once a week (Cho and Molina, 2000) Applications run over limited resources Search engine coverage – 42% (Lawrence and Giles, 1999) Average time to search engine updates a page – 186 days (Lawrence and Giles, 1999) Update too often – waste resources Update sporadically – stale content

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Which Application Faces these Challenges? Proxy server Web archive E.g.: Search engine Stale content Broken links Updated pages not available in index Low quality of results

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Current Solutions Goal: update replicas only when needed Two main approaches: push and pull Push Site or user provides information about change frequency of pages E.g., google sitemaps requires cooperation Pull Application learns change frequency – no cooperation required Expensive to learn – need exhaustive crawls until frequencies are learned Can we do better?

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Our Solution Similar to “pull” approaches Predict change rate of pages Update pages based on this prediction Look at the present to reduce the cost of learning Take page content into account Page content gives good indication of its dynamism (Douglis et al, 1999) Quickly adapts to changes in update frequencies More efficient avoid unnecessary visits

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Updating Web Content: Our Solution Crawler New Page? Static Classifier Change History Change Predictions PageYes Historic Classifier No Change prediction Change prediction Page history Phase 1 Phase 2

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Solution: Overview 1. Gathering the training set 2. Creating the change rate groups 3. Learning static features 4. Learning from History

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Gathering the Training Set 100 most accessed sites of the Brazilian Web Representative sub-set of Web Interesting to Web user Breadth-first search down to depth 9 Total of URLs 2/3 third used to build the classifers 1/3 third used to run the experimental evaluation Each page visited once a day for 100 days Result: Attributes of pages History of page changes– calculate the average change rate of each page

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Creating the Change Rate Groups Predict the average interval of time at which a given page is modified Regression task Discretizing the change rates Classification task Performed an unsupervised discretization Result:

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning Static Features Classify pages in modification groups Based on some static features Relation between some Web page attributes in its dynamism Dynamic pages are larger and have more images (Douglis et al, 1999) The absence of the HTTP header LAST-MODIFIED indicates more volatile pages (Brewington and Cybenko, 2000) Attributes used Presence of the HTTP header LAST-MODIFIED, file size in bytes, number of images, depth of a page in its domain...

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Feature Selection Determining the relevance of different features Make sure that the features are really relevant Wrapper method Uses induction algorithms Chooses the subset that results in the lowest error rate Result Depth of a page in its domain is not relevant Remaining features used in the static classifier

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Static Classifier Classify pages in modification groups Classification algorithms J48: decision tree, NaiveBayes: naïve bayes, IBk: k-nearest neighbor Measures of performance Error test rate Classification time Results AlgorithmsError test rateClassification time J48 without pruning s J48 postpruning s NaivesBayes s IBk with k= s IBk with k= s

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Historic classifier Classify pages in modification groups Based on change history Each modification class has: Average update rate E.g.:1 day, 3 days, 31 days and 96 days Windows size Number of visits to re-classified a page Class with lower average update rate has higher windows size and vice-versa Minimum and maximum change average thresholds

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Example: Two change frequency groups One week: windows size = 3, minimum threshold = 0.4 One month In T 0 Page P belongs to “one week” After 3 weeks P have not changed Average rate of P is 0 P is moved to “one month” group

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Experiment Compare against Bayesian estimator (Cho and Molina) First visit: randomly chosen Over time: bayesian inference 1/3 third of the monitored data:28,233 pages Performance measure: error rate Lower error rate: pages are visited close to the actual frequency

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Results ClassifierError rate Random75.22 J Static classifier is more effective than no- assumption about the page behavior ConfigurationError rate Random + Bayesian34.73 J48 + Historic14.95 Combining historic and static gives the best performance

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Related Work Cho and Molina Uniform policy is always superior to the proportional (non-uniform) approach Overall freshness is maximized, their measure penalizes the most dynamic pages which may not be updated as frequently as they change Pandey and Olston User centric approach to guide the update process Maximize the expected improvement in repository quality. Non-uniform

WIDM Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Conclusion Efficient strategy for keeping replicas of Web content current: Look at page contents Adapt quickly to change in update frequency Static classifier is effective Page contents are good indication of its change behavior Use static and historic information leads to improved performance Future work Take additional features into account, e.g., page rank and backlink Experiment with other learning techniques