HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

The Problem The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strengths Weaknesses Map Reduce DBMS Strengths Weaknesses

The Problem Parallel DBMS The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strong emphasis on performance and efficiency Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize across nodes in a shared-nothing network Parallel databases have been proven to scale really well into the tens of nodes Few known deployments consisting of more than one hundred nodes (no 1000+ systems) Parallel databases tend to be designed with the assumption that failures are a rare event. Failures become increasingly common as one adds more nodes to a system Generally assume a homogeneous array of machines (nearly impossible to achieve) Map Reduce

The Problem Map Reduce The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Map Reduce Well suited for performing analysis at this scale 1000+ nodes in a shared-nothing architecture Originally designed for a largely different application (unstructured text data processing) Unfortunately Map Reduce was not originally designed to perform structured data analysis lacks invaluable DBMS features for structured data analysis workloads Lacks the benefits of modeling and loading data before processing causes an order of magnitude slower performance than parallel databases

The Solution HadoopDB Ideally there should exist a combined solution Scalability of MapReduce Performance and efficiency Parallel DBMS This paper presents such a hybrid system HadoopDB The basic idea Use MapReduce as the communication layer above multiple nodes running single-node DBMS instances Queries are expressed in SQL Using HiveQl queries are translated into MapReduce jobs Much work as possible is pushed into the higher performing single node databases

Approach Map Reduce Job

Approach Map Reduce Job SQL

Approach Map Reduce Job node 1node 2node N.. Map Reduce Job SQL

Approach node 1node 2node N.. Map Reduce Job SQL

Approach node 1node 2node N.. Map Reduce Job

SMS Planner - HiveQL Extends Apache Hive Apache Hive Convert SQL to Map Reduce Creates SQL query plan Specifically for Hadoop Not aware of Parallel DBMS SMS Extends Hive to take advantage of Parallel DBMS Optimizes Hive Query Plan

SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) EXAMPLE..

Performance and Stability Benchmarks Vertica parallel database system column-store system DBMS-X parallel database system row-oriented system Hadoop Map Reduce HadoopDB Hybrid

Performance and Stability Benchmarks HadoopDB slightly outperforms Hadoop However, both systems are outperformed by the parallel databases systems. Vertica and DBMS-X compress their data, which significantly reduces I/O

Performance and Stability Benchmarks Benefit optimizers present in database systems HadoopDB outperforms Hadoop This query is well-suited for column-oriented storage Vertica significantly outperforms the other systems

Performance and Stability Benchmarks Hadoop Performance is limited by completely scanning the dataset on each node in order to evaluate the selection predicate. HadoopDB, DBMS-X, and Vertica all achieve higher performance Take advantage of DBMS index to accelerate the selection predicate Native support for joins.

Fault Tolerance Vertica Shared-Nothing Paralled DBMS Hadoop ( with Hive ) Map Reduce only HadoopDB Hybrid Map Reduce Parallel DBMS Node FailureNode Slowdown

Fault Tolerance (node failure) Vertica Increase in total execution time Overhead for query abortion and complete restart Hadoop ( with Hive ) Tasks of the failed node are distributed over free nodes that contain replicas of the data HadoopDB Tasks of the failed node are distributed over free nodes that contain replicas of the data Node FailureNode Slowdown

Fault Tolerance (node slowdown) Vertica Performance determined by time it takes for the slowest node to complete Waits for the straggler to complete Hadoop (with Hive) Run redundant tasks free nodes HadoopDB Run redundant tasks free nodes Node FailureNode Slowdown

Conclusions HadoopDB is able to approach the performance of parallel database systems PostgreSQL is not a column-store did not use data compression in PostgreSQL. Hadoop and Hive are relatively young open-source projects. We expect future releases to enhance performance. HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce) Achieved a hybrid of the parallel DBMS and Map Reduce HadoopDB operate successfully in heterogeneous environments HadoopDB achieves low cost due to open source Hadoop

Questions???

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Similar presentations

Presentation on theme: "HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Similar presentations

Presentation on theme: "HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook."— Presentation transcript:

Similar presentations

About project

Feedback