Download presentation
Presentation is loading. Please wait.
Published byAugustus Hodge Modified over 8 years ago
1
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley
2
2 The Web A universal information resource –Model weak, strong agreement How to exploit it? web
3
3 WebBase WEB PAGE
4
4 WebBase Goals Manage very large collections of Web pages –Today: 1500GB HTML, 200 M pages Enable large-scale Web-related research Locally provide a significant portion of the Web Efficient wide-area Web data distribution
5
5 WebBase Architecture
6
6 WebBase Remote Users Berkeley Columbia U. Washington Harvey Mudd Università degli Studi di Milano U. of Arizona California Digital Library Cornell U. of Houston Learning Lab Lower Saxony (L3S) France Telecom U. Texas
7
7 Outline Technical Challenges WebBase Use The Future
8
8 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building
9
9 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages
10
10 Parallel Crawling CCC... web
11
11 Independent Crawlers C C web a e d c b site 1 f h i g site 2
12
12 Partition: Firewall C C a e d c b site 1 f h i g site 2 partition ·URL hash ·Site hash ·Hierarchical
13
13 Partition: Cross-Over C C a e d c b site 1 f h i g site 2 partition
14
14 Partition: Cross-Over C C a e d c b site 1 f h i g site 2 partition
15
15 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition
16
16 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition
17
17 Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc
18
18 WebBase Parallel Crawling web site queues... process site queues... process... computer other computers coordinator
19
19 WebBase Parallel Crawling 100% 2 cpu utilzation 0% 200% number of processes
20
20 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building done next
21
21 How to Refresh? a b a b web repository a changes daily b changes once a week can visit one page per week How should we visit pages? –a a a a a a a... –b b b b b b b... –a b a b a b a b... [uniform] –a a a a a a b a a a... [proportional] –?
22
22 Using WebBase Fast Page Rank Complex Queries
23
23 Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu
24
24 Structure of the Web stanford.edu berkeley.edu mit.edu
25
25 Nested Block Structure of the Web Berkeley Stanford from to
26
26 Personalized Page Rank a b
27
27 Complex Queries Stanford WebBase Repository Text search E.g., Search for “SARS Symptoms” Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex queries Declarative analysis interface
28
Example of a Complex Query Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) Compute R = set of all “.edu” domains pointed to by pages in S S R List top 10 domains in R find universities collaborating with Stanford on mobile networking
29
29 Supernodes P1P1 P2P2 P3P3 P4P4 P5P5 Web graph = {N 1, N 2, N 3 } N1N1 N3N3 N2N2 N1N1 N2N2 N3N3 E 1,2 E 3,2 E 1,3 E 3,1 Supernode graph P1P1 P2P2 IntraNode 1 P2P2 P5P5 SEdgePos 1,3 P4P4 P5P5 IntraNode 3 SEdgeNeg 3,2 P5P5 P3P3
30
30 Growth of Supernode Graph 20 30 40 50 60 70 80 90 100 020406080100 120 Number of pages (Millions) Size of supernode graph (MB) 82MB, 115M pages (830 GB of raw HTML)
31
31 Query Execution Times Query Time for navigation operation (secs) 0 100 200 300 400 500 600 Query 1Query 2Query 3Query 4Query 5Query 6 S-Node representation Relational DB Connectivity Server Files of adjacency lists
32
32 Query Optimization P P
33
33 Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times
34
34 Conclusion (So Far) Web is universal information resource WebBase exploits this resource WebBase Challenges: –scalability, consitency, complex queries... The Future for WebBase (and clones)??
35
35 Will WebBase Scale? web content (indexable) webBase capacity (pesimistic) webBase capacity (optimistic) time today
36
36 Pessimistic Scenario Specialized WebBases –sports –shopping –... web content (indexable) webBase capacity (pesimistic) time today
37
37 Optimistic Scenario Web in a Box –web delivered in “CD” monthy –search engine handles updates web content (indexable) webBase capacity (optimistic) time today
38
38 Legal Issues? Is WebBase legal? –copies –links, deep linking International regulations
39
39 Biasing Results How long will Google, Altavista, etc. resist “temptations”? Biasing Crawler Link and Content Spam
40
40 Access Data WebBase does not capture access patterns web ? WebBase
41
41 Semantic Web? Will tags be generated? By whom? Agreement? web ? WebBase semantic tags
42
42 Future Technical Challenges Incremental Updates Query Optimization Crawling Deep Web
43
43 Final Conclusion Many challenges ahead... Additional information: Google: Stanford WebBase WEB PAGE
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.