Presentation is loading. Please wait.

Presentation is loading. Please wait.

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Similar presentations


Presentation on theme: "HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook."— Presentation transcript:

1 HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook

2 Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

3 The Problem The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strengths Weaknesses Map Reduce DBMS Strengths Weaknesses

4 The Problem Parallel DBMS The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strong emphasis on performance and efficiency Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize across nodes in a shared-nothing network Parallel databases have been proven to scale really well into the tens of nodes Few known deployments consisting of more than one hundred nodes (no 1000+ systems) Parallel databases tend to be designed with the assumption that failures are a rare event. Failures become increasingly common as one adds more nodes to a system Generally assume a homogeneous array of machines (nearly impossible to achieve) Map Reduce

5 The Problem Map Reduce The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Map Reduce Well suited for performing analysis at this scale 1000+ nodes in a shared-nothing architecture Originally designed for a largely different application (unstructured text data processing) Unfortunately Map Reduce was not originally designed to perform structured data analysis lacks invaluable DBMS features for structured data analysis workloads Lacks the benefits of modeling and loading data before processing causes an order of magnitude slower performance than parallel databases

6 The Solution HadoopDB Ideally there should exist a combined solution Scalability of MapReduce Performance and efficiency Parallel DBMS This paper presents such a hybrid system HadoopDB The basic idea Use MapReduce as the communication layer above multiple nodes running single-node DBMS instances Queries are expressed in SQL Using HiveQl queries are translated into MapReduce jobs Much work as possible is pushed into the higher performing single node databases

7 Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

8 Approach Map Reduce Job

9 Approach Map Reduce Job SQL

10 Approach Map Reduce Job node 1node 2node N.. Map Reduce Job SQL

11 Approach node 1node 2node N.. Map Reduce Job SQL

12 Approach node 1node 2node N.. Map Reduce Job

13 Approach node 1node 2node N.. Map Reduce Job

14 Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

15 SMS Planner - HiveQL Extends Apache Hive Apache Hive Convert SQL to Map Reduce Creates SQL query plan Specifically for Hadoop Not aware of Parallel DBMS SMS Extends Hive to take advantage of Parallel DBMS Optimizes Hive Query Plan

16 SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

17 SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

18 SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) EXAMPLE..

19 Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

20 Performance and Stability Benchmarks Vertica parallel database system column-store system DBMS-X parallel database system row-oriented system Hadoop Map Reduce HadoopDB Hybrid

21 Performance and Stability Benchmarks HadoopDB slightly outperforms Hadoop However, both systems are outperformed by the parallel databases systems. Vertica and DBMS-X compress their data, which significantly reduces I/O

22 Performance and Stability Benchmarks Benefit optimizers present in database systems HadoopDB outperforms Hadoop This query is well-suited for column-oriented storage Vertica significantly outperforms the other systems

23 Performance and Stability Benchmarks Hadoop Performance is limited by completely scanning the dataset on each node in order to evaluate the selection predicate. HadoopDB, DBMS-X, and Vertica all achieve higher performance Take advantage of DBMS index to accelerate the selection predicate Native support for joins.

24 Fault Tolerance Vertica Shared-Nothing Paralled DBMS Hadoop ( with Hive ) Map Reduce only HadoopDB Hybrid Map Reduce Parallel DBMS Node FailureNode Slowdown

25 Fault Tolerance (node failure) Vertica Increase in total execution time Overhead for query abortion and complete restart Hadoop ( with Hive ) Tasks of the failed node are distributed over free nodes that contain replicas of the data HadoopDB Tasks of the failed node are distributed over free nodes that contain replicas of the data Node FailureNode Slowdown

26 Fault Tolerance (node slowdown) Vertica Performance determined by time it takes for the slowest node to complete Waits for the straggler to complete Hadoop (with Hive) Run redundant tasks free nodes HadoopDB Run redundant tasks free nodes Node FailureNode Slowdown

27 Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

28 Conclusions HadoopDB is able to approach the performance of parallel database systems PostgreSQL is not a column-store did not use data compression in PostgreSQL. Hadoop and Hive are relatively young open-source projects. We expect future releases to enhance performance. HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce) Achieved a hybrid of the parallel DBMS and Map Reduce HadoopDB operate successfully in heterogeneous environments HadoopDB achieves low cost due to open source Hadoop

29 Questions???


Download ppt "HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook."

Similar presentations


Ads by Google