Download presentation
Presentation is loading. Please wait.
Published byLambert Bailey Modified over 8 years ago
1
Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer
2
Problem Statement: How Search Engines (SEs) should have been designed. How to leverage the Database principles in designing Data- Intensive (DI) applications without necessarily using the same semantics.
3
Importance of the paper in the current context: Search Engines have become an important part of life for billions of people. It is intriguing how SE’s manage magnanimous amount of data of the order of 3B documents and increasing every second. Behind-the-scenes challenges of designing SEs in terms of: Ranking Documents Ranking Query Results Availability Freshness of Data Discusses data-intensive applications in the wake of SEs. Finally this paper invokes thought as to how scalable these models can be as in the case of SE’s the data on the internet is increasing every second.
4
Contributions: It gives numbers for various search engine parameters like: No of documents, Data stored, No of queries etc.. etc.. Discusses the challenges of designing a SE. What principles of DBMS can be (should be) applied when designing DI applications like search engines: Top-Down Design Data Independence Declarative Query Language
5
Contributions (contd.): Why SE’s did not use DBMS in the first place? Why SE’s can not be implemented as DBMS in true sense: Speed: DBMS are slow Cost: DBMS are not cost-effective given the magnitude of the data High-Availability vs. Consistency: DBMS prefer consistency in antithesis to SE’s Update: The model of updating data in SE’s is entirely different from databases
6
Contributions (contd.): New Design Uses static databases and large degree of offline work to build and rebuild static databases. Overview of SE design: Crawl, Index, Serve Query (read-only) Scoring of documents and words Making a Query Plan Query Implementation Access Methods and Physical Operators Optimize queries to maximize the through-put of the system Providing redundancy using clustering Compression and other optimizations Updation of data Fault Tolerance
7
Contributions (contd.): SE challenges different from traditional DBMS: Personalization Logging Query rewriting Phrase queries
8
Validations: Author experience on developing Inktomi search engine to come up with improved search engine design. Author also studied the working of various modern Search Engines like Google, Alta-vista, Infoseek.
9
Assumptions: Following are the assumptions that author has made while writing this paper: DI applications are essentially like SE’s and hence should be no different when it comes to utilizing database principles. While scoring the document, author assumes that shorter the length of the document, the higher the score it should be assigned Updates to the systems can always happen offline. It assumes that documents from one site are evenly distributed across the cluster nodes for load balancing.
10
Conclusion of the paper: Data-intensive systems should employ the principles of databases. Many systems are a good fit for DBMS principles (though may not use the same artifacts): Logging System Google File Systems Batch Aware distributed file system
11
Additional information that can be re- written/added if written today: More emphasis and details on Logging: Companies like Google earn their moolah using advertising (of the order of billion of dollars) How the following factors should affect the design of a SE: Probability of Click Attacks Privacy/Copyright concerns while crawling the web Generic Search vs. Search against a particular domain like law or image search Comparisons of the design proposed with one current popular search engine.
12
References: http://en.wikipedia.org/wiki/ http://en.wikipedia.org/wiki/ “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (1998) by Sergey Brin, Lawrence Page
13
Thanks!!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.