Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Similar presentations


Presentation on theme: "How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho"— Presentation transcript:

1 How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
UCLA

2 What is a Crawler? init get next url get page web extract urls
initial urls init to visit urls get next url get page visited urls web extract urls web pages

3 Applications Internet Search Engines Comparison Shopping Services
Google, AltaVista Comparison Shopping Services My Simon, BizRate Data mining Stanford Web Base, IBM Web Fountain

4 Crawling Issues (1) Load at visited web sites
Space out requests to a site Limit number of requests to a site per day Limit depth of crawl

5 Crawling Issues (2) ? Load at crawler Parallelize init init
initial urls init to visit urls init get next url get next url get page get page extract urls extract urls visited urls web pages

6 Crawling Issues (3) Scope of crawl Not enough space for “all” pages
Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel

7 Crawling Issues (4) Replication Pages mirrored at multiple locations

8 Crawling Issues (5) Incremental crawling
How do we avoid crawling from scratch? How do we keep pages “fresh”?

9 My Research On Crawler Load on sites [PAWS00] Parallel crawler [WWW01]
Page selection [WWW7] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00, VLDB02] Crawler architecture [VLDB00]

10 Outline of This Talk How can we maintain pages fresh?
How does the Web change? What do we mean by “fresh” pages? How should we refresh pages?

11 Web Evolution Experiment
How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes?

12 Experimental Setup February 17 to June 24, 1999
270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests

13 Average Change Interval
fraction of pages average change interval

14 Change Interval – By Domain
fraction of pages average change interval

15 Modeling Web Evolution
Poisson process with rate  T is time to next event fT (t) =  e- t (t > 0)

16 Change Interval of Pages
for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

17 Change Metrics  Freshness
Freshness of element ei at time t is F ( ei ; t ) = if ei is up-to-date at time t otherwise ei ... web database Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) (Assume “equal importance” of pages) N 1 i=1

18 Change Metrics Age Age of element ei at time t is A( ei ; t ) = if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) (Assume “equal importance” of pages) N 1 i=1 ei ... web database

19 Change Metrics F(ei) Time averages: 1 time A(ei) time update refresh

20 Refresh Order Fixed order Random order Purely random
Explicit list of URLs to visit Random order Start from seed URLs & follow links Purely random Refresh pages on demand, as requested by user database web ei ei ... ...

21 Freshness vs. Revisit Frequency
r =  / f = average change frequency / average visit frequency

22 Age vs. Revisit Frequency
= Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency

23 Trick Question Two page database e1 changes daily
e2 changes once a week Can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? e1 e1 e2 e2 web database

24 Proportional Often Not Good!
Visit fast changing e1  get 1/2 day of freshness Visit slow changing e2  get 1/2 week of freshness Visiting e2 is a better deal!

25 Optimal Refresh Frequency
Problem Given 1, 1, .., N and f , find f1, f2,.., fN that maximize

26 Optimal Refresh Frequency
Shape of curve is the same in all cases Holds for any change frequency distribution

27 Optimal Refresh for Age
Shape of curve is the same in all cases Holds for any change frequency distribution

28 Comparing Policies Based on Statistics from experiment
and revisit frequency of every month

29 Not Every Page is Equal! F (S ) = 1 F (e1) + 2 F (e2)
 Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day F (S ) = 1 F (e1) + 2 F (e2)  In general,

30 Weighted Freshness f w = 2 w = 1 l

31 Change Frequency Estimation
How to estimate change frequency? Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history 1 day Page visited Page changed Change detected

32 Improved Estimator Based on the Poisson model
X: number of detected changes N: number of accesses f : access frequency 3 changes in 10 days: times/day  Accounts for “missed” changes

33 Improvement Significant?
Application to a Web crawler Visit pages once every week for 5 weeks Estimate change frequency Adjust revisit frequency based on the estimate Uniform: do not adjust Naïve: based on the naïve estimator Ours: based on our improved estimator

34 Improvement from Our Estimator
Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% (9,200,000 visits in total)

35 Other Estimators Irregular access interval Last-modified date
Categorization

36 Summary Web evolution experiment Change metric Refresh policy
Frequency estimator

37 The End Thank you for your attention For more information visit


Download ppt "How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho"

Similar presentations


Ads by Google