WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
The PageRank Citation Ranking “Bringing Order to the Web”
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
A Web Crawler Design for Data Mining
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Master Thesis Defense Jan Fiedler 04/17/98
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Crawling Slides adapted from
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CS246 Search Engine Scale.
Junghoo “John” Cho UCLA
CS246: Search-Engine Scale
Presentation transcript:

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

2 The Web A universal information resource –Model weak, strong agreement How to exploit it? web

3 WebBase WEB PAGE

4 WebBase Goals Manage very large collections of Web pages –Today: 1500GB HTML, 200 M pages Enable large-scale Web-related research Locally provide a significant portion of the Web Efficient wide-area Web data distribution

5 WebBase Architecture

6 WebBase Remote Users Berkeley Columbia U. Washington Harvey Mudd Università degli Studi di Milano U. of Arizona California Digital Library Cornell U. of Houston Learning Lab Lower Saxony (L3S) France Telecom U. Texas

7 Outline Technical Challenges WebBase Use The Future

8 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building

9 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

10 Parallel Crawling CCC... web

11 Independent Crawlers C C web a e d c b site 1 f h i g site 2    

12 Partition: Firewall C C a e d c b site 1 f h i g site 2          partition ·URL hash ·Site hash ·Hierarchical

13 Partition: Cross-Over C C a e d c b site 1 f h i g site 2          partition

14 Partition: Cross-Over C C a e d c b site 1 f h i g site 2     partition

15 Partition: Exchange C C a e d c b site 1 f h i g site 2 partition       

16 Partition: Exchange C C a e d c b site 1 f h i g site 2            partition

17 Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc

18 WebBase Parallel Crawling web site queues... process site queues... process... computer other computers coordinator

19 WebBase Parallel Crawling 100% 2 cpu utilzation 0% 200% number of processes

20 Challenges Scalability –crawling –archive distribution –index construction –storage Consistency –freshness –versions Dissemination Archiving –“units” –coordination IP Management –copy access –link access –access control Hidden Web Topic-Specific Collection Building done next

21 How to Refresh? a b a b web repository a changes daily b changes once a week can visit one page per week How should we visit pages? –a a a a a a a... –b b b b b b b... –a b a b a b a b... [uniform] –a a a a a a b a a a... [proportional] –?

22 Using WebBase Fast Page Rank Complex Queries

23 Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu

24 Structure of the Web stanford.edu berkeley.edu mit.edu

25 Nested Block Structure of the Web Berkeley Stanford from to

26 Personalized Page Rank a b

27 Complex Queries Stanford WebBase Repository Text search E.g., Search for “SARS Symptoms” Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex queries Declarative analysis interface

Example of a Complex Query Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) Compute R = set of all “.edu” domains pointed to by pages in S S R List top 10 domains in R find universities collaborating with Stanford on mobile networking

29 Supernodes P1P1 P2P2 P3P3 P4P4 P5P5 Web graph  = {N 1, N 2, N 3 } N1N1 N3N3 N2N2 N1N1 N2N2 N3N3 E 1,2 E 3,2 E 1,3 E 3,1 Supernode graph P1P1 P2P2 IntraNode 1 P2P2 P5P5 SEdgePos 1,3 P4P4 P5P5 IntraNode 3 SEdgeNeg 3,2 P5P5 P3P3

30 Growth of Supernode Graph Number of pages (Millions) Size of supernode graph (MB) 82MB, 115M pages (830 GB of raw HTML)

31 Query Execution Times Query Time for navigation operation (secs) Query 1Query 2Query 3Query 4Query 5Query 6 S-Node representation Relational DB Connectivity Server Files of adjacency lists

32 Query Optimization P P

33 Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times

34 Conclusion (So Far) Web is universal information resource WebBase exploits this resource WebBase Challenges: –scalability, consitency, complex queries... The Future for WebBase (and clones)??

35 Will WebBase Scale? web content (indexable) webBase capacity (pesimistic) webBase capacity (optimistic) time today

36 Pessimistic Scenario Specialized WebBases –sports –shopping –... web content (indexable) webBase capacity (pesimistic) time today

37 Optimistic Scenario Web in a Box –web delivered in “CD” monthy –search engine handles updates web content (indexable) webBase capacity (optimistic) time today

38 Legal Issues? Is WebBase legal? –copies –links, deep linking International regulations

39 Biasing Results How long will Google, Altavista, etc. resist “temptations”? Biasing Crawler Link and Content Spam

40 Access Data WebBase does not capture access patterns web ? WebBase

41 Semantic Web? Will tags be generated? By whom? Agreement? web ? WebBase semantic tags

42 Future Technical Challenges Incremental Updates Query Optimization Crawling Deep Web

43 Final Conclusion Many challenges ahead... Additional information: Google: Stanford WebBase WEB PAGE