WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley
2 The Web A universal information resource –Model weak, strong agreement How to exploit it? web
3 WebBase WEB PAGE
4 WebBase Goals Manage very large collections of Web pages –Today: 1500GB HTML, 200 M pages Enable large-scale Web-related research Locally provide a significant portion of the Web Efficient wide-area Web data distribution
5 WebBase Architecture
6 WebBase Remote Users Berkeley Columbia U. Washington Harvey Mudd Università degli Studi di Milano U. of Arizona California Digital Library Cornell U. of Houston Learning Lab Lower Saxony (L3S) France Telecom U. Texas
7 Outline Technical Challenges WebBase Use The Future
8 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building
9 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages
10 Parallel Crawling CCC... web
11 Independent Crawlers C C web a e d c b site 1 f h i g site 2
12 Partition: Firewall C C a e d c b site 1 f h i g site 2 partition ·URL hash ·Site hash ·Hierarchical
13 Partition: Cross-Over C C a e d c b site 1 f h i g site 2 partition
14 Partition: Cross-Over C C a e d c b site 1 f h i g site 2 partition
15 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition
16 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition
17 Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc
18 WebBase Parallel Crawling web site queues... process site queues... process... computer other computers coordinator
19 WebBase Parallel Crawling 100% 2 cpu utilzation 0% 200% number of processes
20 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building done next
21 How to Refresh? a b a b web repository a changes daily b changes once a week can visit one page per week How should we visit pages? –a a a a a a a... –b b b b b b b... –a b a b a b a b... [uniform] –a a a a a a b a a a... [proportional] –?
22 Using WebBase Fast Page Rank Complex Queries
23 Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu
24 Structure of the Web stanford.edu berkeley.edu mit.edu
25 Nested Block Structure of the Web Berkeley Stanford from to
26 Personalized Page Rank a b
27 Complex Queries Stanford WebBase Repository Text search E.g., Search for “SARS Symptoms” Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex queries Declarative analysis interface
Example of a Complex Query Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) Compute R = set of all “.edu” domains pointed to by pages in S S R List top 10 domains in R find universities collaborating with Stanford on mobile networking
29 Supernodes P1P1 P2P2 P3P3 P4P4 P5P5 Web graph = {N 1, N 2, N 3 } N1N1 N3N3 N2N2 N1N1 N2N2 N3N3 E 1,2 E 3,2 E 1,3 E 3,1 Supernode graph P1P1 P2P2 IntraNode 1 P2P2 P5P5 SEdgePos 1,3 P4P4 P5P5 IntraNode 3 SEdgeNeg 3,2 P5P5 P3P3
30 Growth of Supernode Graph Number of pages (Millions) Size of supernode graph (MB) 82MB, 115M pages (830 GB of raw HTML)
31 Query Execution Times Query Time for navigation operation (secs) Query 1Query 2Query 3Query 4Query 5Query 6 S-Node representation Relational DB Connectivity Server Files of adjacency lists
32 Query Optimization P P
33 Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times
34 Conclusion (So Far) Web is universal information resource WebBase exploits this resource WebBase Challenges: –scalability, consitency, complex queries... The Future for WebBase (and clones)??
35 Will WebBase Scale? web content (indexable) webBase capacity (pesimistic) webBase capacity (optimistic) time today
36 Pessimistic Scenario Specialized WebBases –sports –shopping –... web content (indexable) webBase capacity (pesimistic) time today
37 Optimistic Scenario Web in a Box –web delivered in “CD” monthy –search engine handles updates web content (indexable) webBase capacity (optimistic) time today
38 Legal Issues? Is WebBase legal? –copies –links, deep linking International regulations
39 Biasing Results How long will Google, Altavista, etc. resist “temptations”? Biasing Crawler Link and Content Spam
40 Access Data WebBase does not capture access patterns web ? WebBase
41 Semantic Web? Will tags be generated? By whom? Agreement? web ? WebBase semantic tags
42 Future Technical Challenges Incremental Updates Query Optimization Crawling Deep Web
43 Final Conclusion Many challenges ahead... Additional information: Google: Stanford WebBase WEB PAGE