Combining Systems and Databases: A Search Engine Retrospective Reviewed By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Information Retrieval in Practice
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.
2/25/2004 The Google Cluster Architecture February 25, 2004.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Chapter 3 Data Storage and Access Methods Title: Operating Systems Support for Database Management Author: Michael Stonebraker Pages: 217 – 223 Group 01:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Chapter 3: Data Storage and Access Methods
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Methodology Conceptual Database Design
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Search Engines
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
BTREE Indices A little context information What’s the purpose of an index? Example of web search engines Queries do not directly search the WWW for data;
1 CSE544 Database Architecture Tuesday, February 1 st, 2011 Slides courtesy of Magda Balazinska.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Introduction. 
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
DATABASE MGMT SYSTEM (BCS 1423) Chapter 5: Methodology – Conceptual Database Design.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Views In some cases, it is not desirable for all users to see the entire logical model (that is, all the actual relations stored in the database.) In some.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
ITGS Databases.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Session 1 Module 1: Introduction to Data Integrity
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
CS 540 Database Management Systems
Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
SQL Basics Review Reviewing what we’ve learned so far…….
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Map Reduce.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Indexing 4/11/2019.
Presentation transcript:

Combining Systems and Databases: A Search Engine Retrospective Reviewed By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer

Overview:  Problem Problem Statement Why is the problem important Why is the problem hard  Approaches Contributions of the paper Assumptions  Validations  Rewrite

Problem Statement:  Given: Current Search Engines and DBMS  Find An efficient search engine design including  A schema to store data  A query language  Implementation of query mechanism Ways to leverage the Database principles in designing Data- Intensive (DI) applications without necessarily using the same semantics.  Objectives Make use of database principles Efficient (Fast) Cost effective Scalable  Constraints Highly available

Why is the problem important?  Widespread use of Search Engines (SEs).  Amount of data to be searched is increasing every second Hence requires scalable design Glimpse of real-data (2005)  # of documents - 3Billion  Data – 10TB  Queries/day – 150 Million

Why is the problem hard?  As mentioned in [2], very little research is done in the area of search engines.  The documents/items to be searched are of the order of several billions.  “Search Items” changes over the time from plain text to multimedia these days.

Contributions:  Discusses the challenges of designing a SE. Ranking Documents Ranking Query Results Availability Freshness of Data  What principles of DBMS can be (should be) applied when designing DI applications like search engines: Top-Down Design Data Independence Declarative Query Language

Contributions (contd.):  Why SE’s can not be implemented as DBMS in true sense: MetricDatabasesSearch Engines SemanticsACIDACID doesn’t hold here SpeedSlowNeeds to be fast CostNot cost effective Amount of Data handled is huge High Availability vs. consistency Consistency is preferred High Availability is preferred UpdatesRegular updatesAt-will Batch updates

Key Concepts:  Ranking and scoring of Documents Word vs. Property Matching Query Q = {w1, w2, w3…. wk} Score(Q,d)  Quality(d) +  Score(wi, d)  Where Quality(d) is the quality of document independent of query words

Key Concepts: Proposed Design for SE’s Overview of SE design: Crawl, Index, Serve Query (read-only)  Scoring of documents and words  Making a Query Plan  Query Implementation Access Methods and Physical Operators Query Optimizer – Map the logical query, exploit caching, minimize the number of joins. Query Execution (on Clusters) Compression and other optimizations

Key Concepts: Proposed Design for SE’s (Contd.) Updation of data  Nodes are independent, Only whole tables updated, Query Atomic updation  Updation using crawling and Indexing Atomic Updates updates  Realtime Deletion and updates  System-wide Updates Fault Tolerance  Goal is High Availability  Disk Faults, Follower Faults, Master Faults  Graceful Degradation and Disaster recovery

Key Concepts (contd.):  Other topics in SEs that are different from DBMS: Personalization  Cookies or Database Logging Query rewriting Phrase queries

Test the concept: Q: How does “Query Optimizer” for Search Engines compare with traditional DBMS. A:  Both use Abstract logical query plan  SEs use Top-down Query Optimizer where as Databases use bottom-up

Assumptions:  The proximity of words is not considered in the overall score for a document. We do not agree with this.  While scoring the document, author assumes that shorter the length of the document, the higher the score it should be assigned, this is not true always.  Search queries are read only which is a valid assumption.  DI applications are essentially like SE’s and hence should be no different when it comes to utilizing database principles. This might not be true always.

Validations:  Author experience with Informix database for building a SE  Author experience on developing Inktomi search engine to come up with improved search engine design.  Working of various modern Search Engines like Google, Alta-vista, Infoseek.

Conclusion of the paper:  Data-intensive systems should employ the principles of databases.  Many systems are a good fit for DBMS principles (though may not use the same artifacts): Logging System Google File Systems Batch Aware distributed file system

Revisions if re-written today:  More emphasis and details on Logging: Companies like Google earn their moolah using advertising (of the order of billion of dollars)  How the following factors affect the design of a SE: Click Attacks Privacy/Copyright concerns while crawling the web Generic Search vs. Search against a particular domain like law/image search/multimedia search  Comparisons of the design proposed with one current popular search engine.

References:  [1] E.A. Brewer, Combining Systems and databases: A Search Engine Retrospective, Readings in Database Systems, J. M. Hellerstein and M. Stonebraker eds. (2005)Combining Systems and databases: A Search Engine Retrospective  [2] Sergey Brin, Lawrence Page “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (1998)The Anatomy of a Large-Scale Hypertextual Web Search Engine  [3]Daniela Florescu, Alon Levy, Alberto Mendelzon, Database Techniques for the World-Wide Web: A Survey (1998), SIGMOD Record.  [4]Charles Frankel, Michael J. Swain, Vassilis Athitsos, WebSeer: An Image Search Engine for the World Wide Web - (1997), ACM. WebSeer: An Image Search Engine for the World Wide Web - (1997)  e e 

Q’n’A and Thanks!!