How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho UCLA
What is a Crawler? init get next url get page web extract urls initial urls init to visit urls get next url get page visited urls web extract urls web pages
Applications Internet Search Engines Comparison Shopping Services Google, AltaVista Comparison Shopping Services My Simon, BizRate Data mining Stanford Web Base, IBM Web Fountain
Crawling Issues (1) Load at visited web sites Space out requests to a site Limit number of requests to a site per day Limit depth of crawl
Crawling Issues (2) ? Load at crawler Parallelize init init initial urls init to visit urls init get next url get next url get page get page extract urls extract urls visited urls web pages
Crawling Issues (3) Scope of crawl Not enough space for “all” pages Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel
Crawling Issues (4) Replication Pages mirrored at multiple locations
Crawling Issues (5) Incremental crawling How do we avoid crawling from scratch? How do we keep pages “fresh”?
My Research On Crawler Load on sites [PAWS00] Parallel crawler [WWW01] Page selection [WWW7] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00, VLDB02] Crawler architecture [VLDB00]
Outline of This Talk How can we maintain pages fresh? How does the Web change? What do we mean by “fresh” pages? How should we refresh pages?
Web Evolution Experiment How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes?
Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests
Average Change Interval fraction of pages ¾ ¾ average change interval
Change Interval – By Domain fraction of pages ¾ ¾ average change interval
Modeling Web Evolution Poisson process with rate T is time to next event fT (t) = e- t (t > 0)
Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
Change Metrics Freshness Freshness of element ei at time t is F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise ei ... web database Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) (Assume “equal importance” of pages) N 1 i=1
Change Metrics Age Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) (Assume “equal importance” of pages) N 1 i=1 ei ... web database
Change Metrics F(ei) Time averages: 1 time A(ei) time update refresh
Refresh Order Fixed order Random order Purely random Explicit list of URLs to visit Random order Start from seed URLs & follow links Purely random Refresh pages on demand, as requested by user database web ei ei ... ...
Freshness vs. Revisit Frequency r = / f = average change frequency / average visit frequency
Age vs. Revisit Frequency = Age / time to refresh all N elements r = / f = average change frequency / average visit frequency
Trick Question Two page database e1 changes daily e2 changes once a week Can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? e1 e1 e2 e2 web database
Proportional Often Not Good! Visit fast changing e1 get 1/2 day of freshness Visit slow changing e2 get 1/2 week of freshness Visiting e2 is a better deal!
Optimal Refresh Frequency Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize
Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution
Optimal Refresh for Age Shape of curve is the same in all cases Holds for any change frequency distribution
Comparing Policies Based on Statistics from experiment and revisit frequency of every month
Not Every Page is Equal! F (S ) = 1 F (e1) + 2 F (e2) Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day F (S ) = 1 F (e1) + 2 F (e2) In general,
Weighted Freshness f w = 2 w = 1 l
Change Frequency Estimation How to estimate change frequency? Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history 1 day Page visited Page changed Change detected
Improved Estimator Based on the Poisson model X: number of detected changes N: number of accesses f : access frequency 3 changes in 10 days: 0.36 times/day Accounts for “missed” changes
Improvement Significant? Application to a Web crawler Visit pages once every week for 5 weeks Estimate change frequency Adjust revisit frequency based on the estimate Uniform: do not adjust Naïve: based on the naïve estimator Ours: based on our improved estimator
Improvement from Our Estimator Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% (9,200,000 visits in total)
Other Estimators Irregular access interval Last-modified date Categorization
Summary Web evolution experiment Change metric Refresh policy Frequency estimator
The End Thank you for your attention For more information visit http://www.cs.ucla.edu/~cho/