HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Slides:

Advertisements

Similar presentations

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Advertisements

Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.

Spark: Cluster Computing with Working Sets

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.

Summary of “ Oracle does about-face on NoSQL ” Jaikumar Vijayan, ComputerWorld, Oct 4th, 2011 Presented by: James Klassen.

Overview Distributed vs. decentralized Why distributed databases

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.

Cloud Computing Other Mapreduce issues Keke Chen.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

MapReduce VS Parallel DBMSs

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

An Introduction to HDInsight June 27 th,

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.

MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.

Prediction-Based Multivariate Query Modeling Analytic Queries.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Microsoft Ignite /28/2017 6:07 PM

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Hadoop Aakash Kag What Why How 1.

Curator: Self-Managing Storage for Enterprise Clusters

Pathology Spatial Analysis February 2017

Spark Presentation.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Cse 344 May 4th – Map/Reduce.

Akshay Tomar Prateek Singh Lohchubh

Ch 4. The Evolution of Analytic Scalability

Hadoop Technopoints.

Charles Tappert Seidenberg School of CSIS, Pace University

Database System Architectures

Presentation transcript:

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

The Problem The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strengths Weaknesses Map Reduce DBMS Strengths Weaknesses

The Problem Parallel DBMS The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Strong emphasis on performance and efficiency Large scan operations (i.e. multidimensional aggregations, and joins) are easy to parallelize across nodes in a shared-nothing network Parallel databases have been proven to scale really well into the tens of nodes Few known deployments consisting of more than one hundred nodes (no systems) Parallel databases tend to be designed with the assumption that failures are a rare event. Failures become increasingly common as one adds more nodes to a system Generally assume a homogeneous array of machines (nearly impossible to achieve) Map Reduce

The Problem Map Reduce The amount of STRUCTERED data that needs to be analyzed is exploding requiring hundreds to thousands of machines to work in parallel to perform the analysis. Two Major Approaches Parallel DBMS Map Reduce Well suited for performing analysis at this scale nodes in a shared-nothing architecture Originally designed for a largely different application (unstructured text data processing) Unfortunately Map Reduce was not originally designed to perform structured data analysis lacks invaluable DBMS features for structured data analysis workloads Lacks the benefits of modeling and loading data before processing causes an order of magnitude slower performance than parallel databases

The Solution HadoopDB Ideally there should exist a combined solution Scalability of MapReduce Performance and efficiency Parallel DBMS This paper presents such a hybrid system HadoopDB The basic idea Use MapReduce as the communication layer above multiple nodes running single-node DBMS instances Queries are expressed in SQL Using HiveQl queries are translated into MapReduce jobs Much work as possible is pushed into the higher performing single node databases

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

Approach Map Reduce Job

Approach Map Reduce Job SQL

Approach Map Reduce Job node 1node 2node N.. Map Reduce Job SQL

Approach node 1node 2node N.. Map Reduce Job SQL

Approach node 1node 2node N.. Map Reduce Job

Approach node 1node 2node N.. Map Reduce Job

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

SMS Planner - HiveQL Extends Apache Hive Apache Hive Convert SQL to Map Reduce Creates SQL query plan Specifically for Hadoop Not aware of Parallel DBMS SMS Extends Hive to take advantage of Parallel DBMS Optimizes Hive Query Plan

SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

SMS Planner – HiveQL (continued) EXAMPLE SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Planner – HiveQL (continued) EXAMPLE..

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

Performance and Stability Benchmarks Vertica parallel database system column-store system DBMS-X parallel database system row-oriented system Hadoop Map Reduce HadoopDB Hybrid

Performance and Stability Benchmarks HadoopDB slightly outperforms Hadoop However, both systems are outperformed by the parallel databases systems. Vertica and DBMS-X compress their data, which significantly reduces I/O

Performance and Stability Benchmarks Benefit optimizers present in database systems HadoopDB outperforms Hadoop This query is well-suited for column-oriented storage Vertica significantly outperforms the other systems

Performance and Stability Benchmarks Hadoop Performance is limited by completely scanning the dataset on each node in order to evaluate the selection predicate. HadoopDB, DBMS-X, and Vertica all achieve higher performance Take advantage of DBMS index to accelerate the selection predicate Native support for joins.

Fault Tolerance Vertica Shared-Nothing Paralled DBMS Hadoop ( with Hive ) Map Reduce only HadoopDB Hybrid Map Reduce Parallel DBMS Node FailureNode Slowdown

Fault Tolerance (node failure) Vertica Increase in total execution time Overhead for query abortion and complete restart Hadoop ( with Hive ) Tasks of the failed node are distributed over free nodes that contain replicas of the data HadoopDB Tasks of the failed node are distributed over free nodes that contain replicas of the data Node FailureNode Slowdown

Fault Tolerance (node slowdown) Vertica Performance determined by time it takes for the slowest node to complete Waits for the straggler to complete Hadoop (with Hive) Run redundant tasks free nodes HadoopDB Run redundant tasks free nodes Node FailureNode Slowdown

Roadmap The Problem (introduction/background) Map Reduce Parallel DBMS HadoopDB The Approach (HadoopDB) System Architecture Performance Efficiency Fault tolerance Benchmarks Conclusion Questions

Conclusions HadoopDB is able to approach the performance of parallel database systems PostgreSQL is not a column-store did not use data compression in PostgreSQL. Hadoop and Hive are relatively young open-source projects. We expect future releases to enhance performance. HadoopDB achieves similar fault tolerance of Hadoop (Map Reduce) Achieved a hybrid of the parallel DBMS and Map Reduce HadoopDB operate successfully in heterogeneous environments HadoopDB achieves low cost due to open source Hadoop

Questions???