Jan 27, 2005791 Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Random Sampling, Point Estimation and Maximum Likelihood.

Master Thesis Defense Jan Fiedler 04/17/98

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.

1 Searching the Web Representation and Management of Data on the Internet.

A Method for Transparent Admission Control and Request Scheduling in E-Commerce Web Sites S. Elnikety, E. Nahum, J. Tracey and W. Zwaenpoel Presented By.

Search Engines By: Faruq Hasan.

Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.

1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

1 CS 430: Information Discovery Lecture 5 Ranking.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)

Search Engine Optimization Miami (SEO Services Miami in affordable budget)

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

SEARCH ENGINE OPTIMIZATION.

Metasearch Thanks to Eric Glover NEC Research Institute.

CS-791/891--Preservation of Digital Objects and Collections

Old Dominion University Feburary 1st, 2005

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

7CCSMWAL Algorithmic Issues in the WWW

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Four Dimensions of Search Engine Quality

IST 497 Vladimir Belyavskiy 11/21/02

CS246 Page Refresh.

Information Retrieval

Robotic Search Engines for the Physical World

Search Engine Optimization (SEO)

CS246 Search Engine Scale.

Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.

Junghoo “John” Cho UCLA

CS246: Search-Engine Scale

Building Topic/Trend Detection System based on Slow Intelligence

Presentation transcript:

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter: Suchitra Manepalli

Jan 27, Searching on the Web Information Overload Indexing  Google, Alta-Vista Integration  BizRate Focus on Indexing

Jan 27, How Google Works? Copyright © 2003 Google Inc.

Jan 27, Crawling web init get next url get page extract urls initial urls to visit urls visited urls web pages Taken from Cho Thesis

Jan 27, Challenges Page selection and scrape  What page to scrape? Page and index update  How to update pages? Page ranking  What page is “important” or “relevant”?  Determine “Canonical” copy? Scalability  What is the maximum number of pages that we can afford to “index”?

Jan 27, Focusing Page selection and scrape  What page to scrape? Page and index update  How to update pages? Page ranking  What page is “important” or “relevant”?  Determine “Canonical” copy? Scalability  What is the maximum number of pages that we can afford to “index”?

Jan 27, 20057

8

9

10

Jan 27, Presentation Outline Introduction Problems Framework – Effective Solutions Different policies Weighted Freshness Experiments Conclusion

Jan 27, Introduction Between web-crawling, the web-site changes in-deterministically Main Issue  How often do we crawl?

Jan 27, Questions H ow can we maintain pages fresh?  What are “fresh” pages?  How often should the index be maintained?  What constraints are posted?  What are the refresh policies?  How effective are the refresh policies?

Jan 27, “Freshness” Assuming each element is equally important Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise Freshness of the database S at time t is F( S ; t ) = F( e i ; t )  N 1 N i=1 eiei eiei... webdatabase

Jan 27, “Age” Assume equal importance of pages Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t - (modification e i time) otherwise Age of the database S at time t is A( S ; t ) = A( e i ; t )  N 1 N i=1 eiei eiei... webdatabase

Jan 27, “Freshness” and “Age” F(e i ) A(e i ) time update refresh

Jan 27, Poisson process Real world  Elements are modified by a Poisson process  Happen randomly and independently with a fixed rate over time

Jan 27, “Expected” - Variables Next event occurs in a Poisson process with change rate λ Probability of e i changes at least once in the time interval (0,t] is

Jan 27, “Expected” - Equations Expected Freshness Expected Age

Jan 27, “Expected” - Graphs

Jan 27, Evolution Model of Database Uniform Change Frequency Model  All real-world elements change at the same frequency λ Individual element changes over time All elements change at the same average rate Non-Uniform Change Frequency Model  Elements change at different rates

Jan 27, Histogram of Change Frequencies

Jan 27, Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points

Jan 27, Synchronization Policies Synchronization Frequency  How frequently do we synchronize the database  More often, more fresher Resource Allocation  How frequently we should synchronize each individual element  Uniform Allocation Policy  Non-Uniform Allocation Policy

Jan 27, Synchronization Policies Synchronization Order  What order we need to synchronize the elements?  Fixed order Same order repeatedly  Random order Synchronization order is different in each iteration  Purely random At each synchronization point, we select a random element from the database and synchronize it

Jan 27, Synchronization Policies Synchronization Points

Jan 27, Synchronization Order - Policies Fixed order policy

Jan 27, Synchronization Order - Policies Random order policy

Jan 27, Synchronization Order - Policies Purely Random Policy

Jan 27, Comparison

Jan 27, Resource Allocation Policies What can we do if the elements change at different rates and we know how often each element changes? Is it better to synchronize an element more often when it changes more often? Is it better to synchronize equally?

Jan 27, Trick Question Two page database e 1 changes daily e 2 changes once a week We can visit one page per week How should we visit pages?  e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform]  e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional]

Jan 27, Proportional is often not good Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

Jan 27, Uniform versus Proportional Intuitively assume, proportional allocation policy performs better than uniform policy Two element database Uniform policy is actually better To improve freshness we should penalize the elements that change too often

Jan 27, Weighted Freshness If elements have different importance? Synchronize the elements to maximize the freshness of the database perceived by the users? Refresh one more than the other?

Jan 27, Weighted Freshness Metrics To capture the concept: weights are given Freshness:  Age: 

Jan 27, Experimental Setup 270 sites visited  identified 400 sites with highest “PageRank”  contacted administrators February 17 to June 24, ,000 pages from each site daily  start at root, visit breadth first (get new & old pages)  ran only 9pm - 6am, 10 seconds between site requests

Jan 27, Change interval of pages Pages change 10 days

Jan 27, Results Indicate a Poisson curve as predicted Constraint:  Crawled web pages on a daily basis  Does not verify for pages that change: Very often Less frequent Typical crawling rate of search engines, exact change is of relative importance For example: Google

Jan 27, Experiment 2: Synchronization-order Selected pages with average change frequency : Two weeks Simulated multiple crawls:  Once a day  Once every week  Once every month  Once every two months Assumed page changed in middle of the day

Jan 27, Synchronization-order policy

Jan 27, Results Theoretical implications  How can we measure how fresh a local database is?  How can we guarantee certain freshness of a local database?

Jan 27, Experiment 3: Frequency of Change Average change interval of a page  Dividing monitoring period by the number of detected changes in a page  Page changed 4 times in 4 month period Estimate the average change interval of the page: 4 months/4 = 1 month

Jan 27, Frequency of Change

Jan 27, Results Pages maintained at commercial sites:  Updated frequently Gives, reasonable average change interval for most pages Estimation may not be accurate  If page changes more than once every day  If page changes several times a day, but remains static for a week

Jan 27, Experiment 4: Resource-Allocation How frequently we synchronize each group Previous experiment:  23% of pages change every day  15% change every week  That did not change for 4 months, changes for a year Tests for :  Proportional  Uniform  Optimal

Jan 27, Resource Allocation Policy

Jan 27, Results Proportional policy performs very poorly when pages change very often Optimal policy becomes relatively more effective than the uniform policy Lesson learned:  Optimal policy performs better to monitor frequently changing information

Jan 27, Conclusion Proportional Synchronization Policy  Intuitive appealing  Does not work well Optimal policies  Improve freshness and age significantly using real web data

Jan 27, Conclusion Two Metrics  “Freshness”  “Age” Synchronization Policies  Synchronization Frequency  Resource Allocation  Synchronization Order  Synchronization Points

Jan 27, References  Interesting information: talks, experiments, publications, course material  How google works  Google indexing tips  Google Page rank algorithm explained