How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Crawling the WEB Representation and Management of Data on the Internet.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
A Web Crawler Design for Data Mining
CORE 2: Information systems and Databases CENTRALISED AND DISTRIBUTED DATABASES.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawling The Web For a Search Engine Or Why Crawling is Cool.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
1 Searching the Web Representation and Management of Data on the Internet.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web crawler
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Evolution of Web from a Search Engine Perspective Saket Singam
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Metasearch Thanks to Eric Glover NEC Research Institute.
CS-791/891--Preservation of Digital Objects and Collections
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Old Dominion University Feburary 1st, 2005
7CCSMWAL Algorithmic Issues in the WWW
IST 497 Vladimir Belyavskiy 11/21/02
CS246 Page Refresh.
Finding replicated web collections
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
CS246 Search Engine Scale.
Junghoo “John” Cho UCLA
CS246: Search-Engine Scale
Online Analytical Processing Stream Data: Is It Feasible?
Presentation transcript:

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho UCLA

What is a Crawler? init get next url get page web extract urls initial urls init to visit urls get next url get page visited urls web extract urls web pages

Applications Internet Search Engines Comparison Shopping Services Google, AltaVista Comparison Shopping Services My Simon, BizRate Data mining Stanford Web Base, IBM Web Fountain

Crawling Issues (1) Load at visited web sites Space out requests to a site Limit number of requests to a site per day Limit depth of crawl

Crawling Issues (2) ? Load at crawler Parallelize init init initial urls init to visit urls init get next url get next url get page get page extract urls extract urls visited urls web pages

Crawling Issues (3) Scope of crawl Not enough space for “all” pages Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel

Crawling Issues (4) Replication Pages mirrored at multiple locations

Crawling Issues (5) Incremental crawling How do we avoid crawling from scratch? How do we keep pages “fresh”?

My Research On Crawler Load on sites [PAWS00] Parallel crawler [WWW01] Page selection [WWW7] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00, VLDB02] Crawler architecture [VLDB00]

Outline of This Talk How can we maintain pages fresh? How does the Web change? What do we mean by “fresh” pages? How should we refresh pages?

Web Evolution Experiment How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes?

Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests

Average Change Interval fraction of pages ¾ ¾ average change interval

Change Interval – By Domain fraction of pages ¾ ¾ average change interval

Modeling Web Evolution Poisson process with rate  T is time to next event fT (t) =  e- t (t > 0)

Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

Change Metrics  Freshness Freshness of element ei at time t is F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise ei ... web database Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) (Assume “equal importance” of pages)  N 1 i=1

Change Metrics Age Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) (Assume “equal importance” of pages)  N 1 i=1 ei ... web database

Change Metrics F(ei) Time averages: 1 time A(ei) time update refresh

Refresh Order Fixed order Random order Purely random Explicit list of URLs to visit Random order Start from seed URLs & follow links Purely random Refresh pages on demand, as requested by user database web ei ei ... ...

Freshness vs. Revisit Frequency r =  / f = average change frequency / average visit frequency

Age vs. Revisit Frequency = Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency

Trick Question Two page database e1 changes daily e2 changes once a week Can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? e1 e1 e2 e2 web database

Proportional Often Not Good! Visit fast changing e1  get 1/2 day of freshness Visit slow changing e2  get 1/2 week of freshness Visiting e2 is a better deal!

Optimal Refresh Frequency Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize

Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution

Optimal Refresh for Age Shape of curve is the same in all cases Holds for any change frequency distribution

Comparing Policies Based on Statistics from experiment and revisit frequency of every month

Not Every Page is Equal! F (S ) = 1 F (e1) + 2 F (e2)  Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day F (S ) = 1 F (e1) + 2 F (e2)  In general,

Weighted Freshness f w = 2 w = 1 l

Change Frequency Estimation How to estimate change frequency? Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history 1 day Page visited Page changed Change detected

Improved Estimator Based on the Poisson model X: number of detected changes N: number of accesses f : access frequency 3 changes in 10 days: 0.36 times/day  Accounts for “missed” changes

Improvement Significant? Application to a Web crawler Visit pages once every week for 5 weeks Estimate change frequency Adjust revisit frequency based on the estimate Uniform: do not adjust Naïve: based on the naïve estimator Ours: based on our improved estimator

Improvement from Our Estimator Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% (9,200,000 visits in total)

Other Estimators Irregular access interval Last-modified date Categorization

Summary Web evolution experiment Change metric Refresh policy Frequency estimator

The End Thank you for your attention For more information visit http://www.cs.ucla.edu/~cho/