Download presentation
Presentation is loading. Please wait.
Published byLilian Ray Modified over 9 years ago
1
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook
2
Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions
3
The Problem The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strengths Weaknesses Map Reduce DBMS Strengths Weaknesses
4
The Problem Parallel DBMS The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strong emphasis on performance and efficiency Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize across nodes in a shared-nothing network Parallel databases have been proven to scale really well into the tens of nodes Few known deployments consisting of more than one hundred nodes (no 1000+ systems) Parallel databases tend to be designed with the assumption that failures are a rare event. Failures become increasingly common as one adds more nodes to a system Generally assume a homogeneous array of machines (nearly impossible to achieve) Map Reduce
5
The Problem Map Reduce The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Map Reduce Well suited for performing analysis at this scale 1000+ nodes in a shared-nothing architecture Originally designed for a largely different application (unstructured text data processing) Unfortunately Map Reduce was not originally designed to perform structured data analysis lacks invaluable DBMS features for structured data analysis workloads Lacks the benefits of modeling and loading data before processing causes an order of magnitude slower performance than parallel databases
6
The Solution HadoopDB Ideally there should exist a combined solution Scalability of MapReduce Performance and efficiency Parallel DBMS This paper presents such a hybrid system HadoopDB The basic idea Use MapReduce as the communication layer above multiple nodes running single-node DBMS instances Queries are expressed in SQL Using HiveQl queries are translated into MapReduce jobs Much work as possible is pushed into the higher performing single node databases
7
Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions
8
Approach Map Reduce Job
9
Approach Map Reduce Job SQL
10
Approach Map Reduce Job node 1node 2node N.. Map Reduce Job SQL
11
Approach node 1node 2node N.. Map Reduce Job SQL
12
Approach node 1node 2node N.. Map Reduce Job
13
Approach node 1node 2node N.. Map Reduce Job
14
Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions
15
SMS Planner - HiveQL Extends Apache Hive Apache Hive Convert SQL to Map Reduce Creates SQL query plan Specifically for Hadoop Not aware of Parallel DBMS SMS Extends Hive to take advantage of Parallel DBMS Optimizes Hive Query Plan
16
SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);
17
SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);
18
SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) EXAMPLE..
19
Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions
20
Performance and Stability Benchmarks Vertica parallel database system column-store system DBMS-X parallel database system row-oriented system Hadoop Map Reduce HadoopDB Hybrid
21
Performance and Stability Benchmarks HadoopDB slightly outperforms Hadoop However, both systems are outperformed by the parallel databases systems. Vertica and DBMS-X compress their data, which significantly reduces I/O
22
Performance and Stability Benchmarks Benefit optimizers present in database systems HadoopDB outperforms Hadoop This query is well-suited for column-oriented storage Vertica significantly outperforms the other systems
23
Performance and Stability Benchmarks Hadoop Performance is limited by completely scanning the dataset on each node in order to evaluate the selection predicate. HadoopDB, DBMS-X, and Vertica all achieve higher performance Take advantage of DBMS index to accelerate the selection predicate Native support for joins.
24
Fault Tolerance Vertica Shared-Nothing Paralled DBMS Hadoop ( with Hive ) Map Reduce only HadoopDB Hybrid Map Reduce Parallel DBMS Node FailureNode Slowdown
25
Fault Tolerance (node failure) Vertica Increase in total execution time Overhead for query abortion and complete restart Hadoop ( with Hive ) Tasks of the failed node are distributed over free nodes that contain replicas of the data HadoopDB Tasks of the failed node are distributed over free nodes that contain replicas of the data Node FailureNode Slowdown
26
Fault Tolerance (node slowdown) Vertica Performance determined by time it takes for the slowest node to complete Waits for the straggler to complete Hadoop (with Hive) Run redundant tasks free nodes HadoopDB Run redundant tasks free nodes Node FailureNode Slowdown
27
Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions
28
Conclusions HadoopDB is able to approach the performance of parallel database systems PostgreSQL is not a column-store did not use data compression in PostgreSQL. Hadoop and Hive are relatively young open-source projects. We expect future releases to enhance performance. HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce) Achieved a hybrid of the parallel DBMS and Map Reduce HadoopDB operate successfully in heterogeneous environments HadoopDB achieves low cost due to open source Hadoop
29
Questions???
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.