Presentation is loading. Please wait.

Presentation is loading. Please wait.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Similar presentations


Presentation on theme: "WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,"— Presentation transcript:

1 WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

2 2 The Web A universal information resource –Model weak, strong agreement How to exploit it? web

3 3 WebBase WEB PAGE

4 4 WebBase Goals Manage very large collections of Web pages –Today: 1500GB HTML, 200 M pages Enable large-scale Web-related research Locally provide a significant portion of the Web Efficient wide-area Web data distribution

5 5 WebBase Architecture

6 6 WebBase Remote Users Berkeley Columbia U. Washington Harvey Mudd Università degli Studi di Milano U. of Arizona California Digital Library Cornell U. of Houston Learning Lab Lower Saxony (L3S) France Telecom U. Texas

7 7 Outline Technical Challenges WebBase Use The Future

8 8 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building

9 9 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

10 10 Parallel Crawling CCC... web

11 11 Independent Crawlers C C web a e d c b site 1 f h i g site 2    

12 12 Partition: Firewall C C a e d c b site 1 f h i g site 2          partition ·URL hash ·Site hash ·Hierarchical

13 13 Partition: Cross-Over C C a e d c b site 1 f h i g site 2          partition

14 14 Partition: Cross-Over C C a e d c b site 1 f h i g site 2     partition

15 15 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition       

16 16 Partition: Exchange C C a e d c b site 1 f h i g site 2            partition

17 17 Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc

18 18 WebBase Parallel Crawling web site queues... process site queues... process... computer other computers coordinator

19 19 WebBase Parallel Crawling 100% 2 cpu utilzation 0% 200% number of processes

20 20 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building done next

21 21 How to Refresh? a b a b web repository a changes daily b changes once a week can visit one page per week How should we visit pages? –a a a a a a a... –b b b b b b b... –a b a b a b a b... [uniform] –a a a a a a b a a a... [proportional] –?

22 22 Using WebBase Fast Page Rank Complex Queries

23 23 Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu

24 24 Structure of the Web stanford.edu berkeley.edu mit.edu

25 25 Nested Block Structure of the Web Berkeley Stanford from to

26 26 Personalized Page Rank a b

27 27 Complex Queries Stanford WebBase Repository Text search E.g., Search for “SARS Symptoms” Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex queries Declarative analysis interface

28 Example of a Complex Query Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) Compute R = set of all “.edu” domains pointed to by pages in S S R List top 10 domains in R find universities collaborating with Stanford on mobile networking

29 29 Supernodes P1P1 P2P2 P3P3 P4P4 P5P5 Web graph  = {N 1, N 2, N 3 } N1N1 N3N3 N2N2 N1N1 N2N2 N3N3 E 1,2 E 3,2 E 1,3 E 3,1 Supernode graph P1P1 P2P2 IntraNode 1 P2P2 P5P5 SEdgePos 1,3 P4P4 P5P5 IntraNode 3 SEdgeNeg 3,2 P5P5 P3P3

30 30 Growth of Supernode Graph 20 30 40 50 60 70 80 90 100 020406080100 120 Number of pages (Millions) Size of supernode graph (MB) 82MB, 115M pages (830 GB of raw HTML)

31 31 Query Execution Times Query Time for navigation operation (secs) 0 100 200 300 400 500 600 Query 1Query 2Query 3Query 4Query 5Query 6 S-Node representation Relational DB Connectivity Server Files of adjacency lists

32 32 Query Optimization P P

33 33 Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times

34 34 Conclusion (So Far) Web is universal information resource WebBase exploits this resource WebBase Challenges: –scalability, consitency, complex queries... The Future for WebBase (and clones)??

35 35 Will WebBase Scale? web content (indexable) webBase capacity (pesimistic) webBase capacity (optimistic) time today

36 36 Pessimistic Scenario Specialized WebBases –sports –shopping –... web content (indexable) webBase capacity (pesimistic) time today

37 37 Optimistic Scenario Web in a Box –web delivered in “CD” monthy –search engine handles updates web content (indexable) webBase capacity (optimistic) time today

38 38 Legal Issues? Is WebBase legal? –copies –links, deep linking International regulations

39 39 Biasing Results How long will Google, Altavista, etc. resist “temptations”? Biasing Crawler Link and Content Spam

40 40 Access Data WebBase does not capture access patterns web ? WebBase

41 41 Semantic Web? Will tags be generated? By whom? Agreement? web ? WebBase semantic tags

42 42 Future Technical Challenges Incremental Updates Query Optimization Crawling Deep Web

43 43 Final Conclusion Many challenges ahead... Additional information: Google: Stanford WebBase WEB PAGE


Download ppt "WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,"

Similar presentations


Ads by Google