Download presentation
Presentation is loading. Please wait.
Published byLionel Singleton Modified over 6 years ago
1
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
UCLA
2
What is a Crawler? init get next url get page web extract urls
initial urls init to visit urls get next url get page visited urls web extract urls web pages
3
Applications Internet Search Engines Comparison Shopping Services
Google, AltaVista Comparison Shopping Services My Simon, BizRate Data mining Stanford Web Base, IBM Web Fountain
4
Crawling Issues (1) Load at visited web sites
Space out requests to a site Limit number of requests to a site per day Limit depth of crawl
5
Crawling Issues (2) ? Load at crawler Parallelize init init
initial urls init to visit urls init get next url get next url get page get page extract urls extract urls visited urls web pages
6
Crawling Issues (3) Scope of crawl Not enough space for “all” pages
Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel
7
Crawling Issues (4) Replication Pages mirrored at multiple locations
8
Crawling Issues (5) Incremental crawling
How do we avoid crawling from scratch? How do we keep pages “fresh”?
9
My Research On Crawler Load on sites [PAWS00] Parallel crawler [WWW01]
Page selection [WWW7] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00, VLDB02] Crawler architecture [VLDB00]
10
Outline of This Talk How can we maintain pages fresh?
How does the Web change? What do we mean by “fresh” pages? How should we refresh pages?
11
Web Evolution Experiment
How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes?
12
Experimental Setup February 17 to June 24, 1999
270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests
13
Average Change Interval
fraction of pages average change interval
14
Change Interval – By Domain
fraction of pages average change interval
15
Modeling Web Evolution
Poisson process with rate T is time to next event fT (t) = e- t (t > 0)
16
Change Interval of Pages
for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
17
Change Metrics Freshness
Freshness of element ei at time t is F ( ei ; t ) = if ei is up-to-date at time t otherwise ei ... web database Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) (Assume “equal importance” of pages) N 1 i=1
18
Change Metrics Age Age of element ei at time t is A( ei ; t ) = if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) (Assume “equal importance” of pages) N 1 i=1 ei ... web database
19
Change Metrics F(ei) Time averages: 1 time A(ei) time update refresh
20
Refresh Order Fixed order Random order Purely random
Explicit list of URLs to visit Random order Start from seed URLs & follow links Purely random Refresh pages on demand, as requested by user database web ei ei ... ...
21
Freshness vs. Revisit Frequency
r = / f = average change frequency / average visit frequency
22
Age vs. Revisit Frequency
= Age / time to refresh all N elements r = / f = average change frequency / average visit frequency
23
Trick Question Two page database e1 changes daily
e2 changes once a week Can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? e1 e1 e2 e2 web database
24
Proportional Often Not Good!
Visit fast changing e1 get 1/2 day of freshness Visit slow changing e2 get 1/2 week of freshness Visiting e2 is a better deal!
25
Optimal Refresh Frequency
Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize
26
Optimal Refresh Frequency
Shape of curve is the same in all cases Holds for any change frequency distribution
27
Optimal Refresh for Age
Shape of curve is the same in all cases Holds for any change frequency distribution
28
Comparing Policies Based on Statistics from experiment
and revisit frequency of every month
29
Not Every Page is Equal! F (S ) = 1 F (e1) + 2 F (e2)
Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day F (S ) = 1 F (e1) + 2 F (e2) In general,
30
Weighted Freshness f w = 2 w = 1 l
31
Change Frequency Estimation
How to estimate change frequency? Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history 1 day Page visited Page changed Change detected
32
Improved Estimator Based on the Poisson model
X: number of detected changes N: number of accesses f : access frequency 3 changes in 10 days: times/day Accounts for “missed” changes
33
Improvement Significant?
Application to a Web crawler Visit pages once every week for 5 weeks Estimate change frequency Adjust revisit frequency based on the estimate Uniform: do not adjust Naïve: based on the naïve estimator Ours: based on our improved estimator
34
Improvement from Our Estimator
Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% (9,200,000 visits in total)
35
Other Estimators Irregular access interval Last-modified date
Categorization
36
Summary Web evolution experiment Change metric Refresh policy
Frequency estimator
37
The End Thank you for your attention For more information visit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.