Download presentation
Presentation is loading. Please wait.
1
1 Searching the Web Junghoo Cho UCLA Computer Science
2
2 Legacy database Plain text files Biblio sever Information Galore
3
3 Information Overload Problem
4
4 Solution Indexing approach Indexing approach –Google, Excite, AltaVista Integration approach Integration approach –MySimon, BizRate
5
5 Indexing Approach Central Index
6
6 Challenges Page selection and download Page selection and download –What page to download? Page and index update Page and index update –How to update pages? Page ranking Page ranking –What page is “important” or “relevant”? Scalability Scalability
7
7 Integration Approach Mediator Wrapper Source 1 Wrapper Source 2 Wrapper Source n
8
8 Heterogeneous sources Heterogeneous sources –Different data models: relational, object-oriented –Different schemas and representations: “Keanu Reeves” or “Reeves, K.” etc. Limited query capabilities Limited query capabilities Mediator caching Mediator caching Challenges
9
9 Focus of the Talk Indexing approach Indexing approach How to maintain pages up-to-date? How to maintain pages up-to-date?
10
10 Outline of This Talk How can we maintain pages fresh? How does the Web change? How does the Web change? What do we mean by “fresh” pages? What do we mean by “fresh” pages? How should we refresh pages? How should we refresh pages?
11
11 Web Evolution Experiment How often does a Web page change? How often does a Web page change? How long does a page stay on the Web? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How long does it take for 50% of the Web to change? How do we model Web changes? How do we model Web changes?
12
12 Experimental Setup February 17 to June 24, 1999 February 17 to June 24, 1999 270 sites visited (with permission) 270 sites visited (with permission) –identified 400 sites with highest “PageRank” –contacted administrators 720,000 pages collected 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests
13
13 Average Change Interval fraction of pages average change interval
14
14 Change Interval – By Domain fraction of pages average change interval
15
15 Modeling Web Evolution Poisson process with rate Poisson process with rate T is time to next event T is time to next event f T (t) = e - t (t > 0) f T (t) = e - t (t > 0)
16
16 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model
17
17 Change Metrics Freshness Freshness –Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) (Assume “equal importance” of pages) N 1 N i=1
18
18 Change Metrics Age Age –Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase Age of the database S at time t is A( S ; t ) = A( e i ; t ) (Assume “equal importance” of pages) N 1 N i=1
19
19 Change Metrics F(e i ) A(e i ) 0 0 1 time update refresh Time averages:
20
20 Trick Question Two page database Two page database changes daily e 1 changes daily changes once a week e 2 changes once a week Can visit one page per week Can visit one page per week How should we visit pages? How should we visit pages? –... [uniform] –e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] – … [proportional] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional] –... –e 1 e 1 e 1 e 1 e 1 e 1... –... –e 2 e 2 e 2 e 2 e 2 e 2... –? e1e1 e2e2 e1e1 e2e2 web database
21
21 Proportional Often Not Good! Visit fast changing Visit fast changing e 1 get 1/2 day of freshness get 1/2 day of freshness Visit slow changing Visit slow changing e 2 get 1/2 week of freshness Visiting is a better deal! Visiting e 2 is a better deal!
22
22 Optimal Refresh Frequency Problem Given and f, find find that maximize that maximize
23
23 Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution
24
24 Optimal Refresh for Age Shape of curve is the same in all cases Holds for any change frequency distribution
25
25 Comparing Policies Based on Statistics from experiment and revisit frequency of every month
26
26 Not Every Page is Equal! In general, e1e1 e2e2 Accessed by users 20 times/day Accessed by users 10 times/day Some pages are “more important” Some pages are “more important”
27
27 Weighted Freshness w = 1 w = 2 f
28
28 Change Frequency Estimation How to estimate change frequency? How to estimate change frequency? –Naïve Estimator: X/T –X: number of detected changes –T: monitoring period –2 changes in 10 days: 0.2 times/day Change detected 1 day Page visited Page changed Incomplete change history Incomplete change history
29
29 Improved Estimator Based on the Poisson model Based on the Poisson model –X: number of detected changes –N: number of accesses –f : access frequency 3 changes in 10 days: 0.36 times/day Accounts for “missed” changes
30
30 Improvement Significant? Application to a Web crawler Application to a Web crawler –Visit pages once every week for 5 weeks –Estimate change frequency –Adjust revisit frequency based on the estimate »Uniform: do not adjust »Naïve: based on the naïve estimator »Ours: based on our improved estimator
31
31 Improvement from Our Estimator Detected changes Ratio to uniform Uniform2,147,589 100% 100% Naïve4,145,582193% Ours4,892,116228% (9,200,000 visits in total)
32
32 Summary Information overload problem Information overload problem –Indexing approach –Integration approach Page update Page update –Web evolution experiment –Change metric –Refresh policy –Frequency estimator
33
33 Research Opportunity Efficient query processing? Efficient query processing? Automatic source discovery? Automatic source discovery? Automatic data extraction? Automatic data extraction?
34
34 Web Archive Project Can we store the history of the Web? Can we store the history of the Web? –Web is ephemeral –Study of the Evolution of the Web Challenges Challenges –Update policy? –Compression? –New storage structure? –New index structure?
35
35 The End Thank you for your attention Thank you for your attention For more information visit For more information visithttp://www.cs.ucla.edu/~cho/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.