PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

CS 540 Database Management Systems

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Clydesdale: Structured Data Processing on MapReduce Jackie.

CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.

Cloud Computing Other Mapreduce issues Keke Chen.

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

HADOOP ADMIN: Session -2

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

MapReduce VS Parallel DBMSs

李智宇、林威宏、施閔耀. + Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Mapping the Data Warehouse to a Multiprocessor Architecture

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.

Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.

MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.

BIG DATA/ Hadoop Interview Questions.

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

CS 405G: Introduction to Database Systems

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Mapping the Data Warehouse to a Multiprocessor Architecture

MapReduce Simplied Data Processing on Large Clusters

A Comparison of Approaches to Large-Scale Data Analysis

Cse 344 May 2nd – Map/reduce.

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Overview of big data tools

Presentation transcript:

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

MAP REDUCE AND PARALLEL DBMS ARE COMPLEMENTARY In 2010, MapReduce (MR) has been hailed as a revolutionary new platform for large-scale, massively parallel data access In 2010, some proponents claimed the extreme scalability of MR will relegate relational database management systems (DBMS) to the status of legacy technology It’s later found that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results As such, MR complements DBMS technology rather than compete with it Parallel DBMS were first available nearly two decades ago As robust high performing platforms they provide a high-level programming environment that is inherently parallelizable It is possible to write almost any parallel processing task as a set of database queries or a set of MR jobs

THE SHARED-NOTHING ARCHITECTURE OF PARALLEL DBMS The initial parallel DBMS systems used the shared-nothing architecture and used horizontal partitioning of relational tables The use of horizontal partitioning is critical to obtaining scalable performance of SQL queries This leads to the concept of partitioned execution of SQL operators like selection, aggregation, join etc.

HORIZONTAL PARTITIONING The idea behind horizontal partitioning is to distribute the rows of the relational table across the nodes of a cluster so that they can be processed in parallel

MAP REDUCE EXAMPLE IN PARALLEL DBMS SELECT custId, amount FROM Sales WHERE date BETWEEN “12/01/2009” AND “12/25/2009” Sales table is round-robin partitioned across the nodes in the cluster Each SELECT operator scans the fragment of the Sales table stored at each node Any rows satisfying the date predicate are passed to a SHUFFLE operator that dynamically repartitions the rows This is done by hashing on the custId Rows are aggregated at each node to find final total for each customer

MAP-REDUCE ADVANTAGES MR is advantageous with ETL and read once data sets. DBMS must parse and verify each datum in the tuples before loading while MR does not. The Distributed infrastructure used to implement MR is cheap Horizontal scalability of MR is better than Parallel DBMS MR is an open source project with detailed documentation There is no popular open source project on parallel DBMS and all the popular ones are from commercial vendors

Comparison - Parallel DBMS over MapReduce Experimental setup Used most popular implementations of MR and Parallel DBMS Results presented are those achieved after best tuning Task NameHadoopDBMS-XVerticaHadoop/ DBMS-X Hadoop/ Vertica MR Grep task284s194s108s1.5x2.6x Web log task1146s740s268s1.6x4.3x Join task1158s32s55s36.3x21.0x 1. MR task - Each system must scan through a data set of 100B records looking for a three-character pattern. 2. Web log task - Conventional SQL aggregation with a GROUP BY clause on a table of user visits in a Web server log 3. Join task - Fairly complex join operation over two tables requiring an additional aggregation and fitering operation

Reasons why PDBMS outperforms MapReduce in experiment 1.Repetitive record parsing - the default configuration of Hadoop stores data in the accompanying distributed file system (HDFS), in the same textual format in which the data was generated. 2.Compression - enabling data compression in the DBMSs delivered a much more significant performance gain than seen in MR. Reason unknown. 3.Pipelining - Though writing data structures to disk gives Hadoop a convenient way to checkpoint the output of intermediate map jobs, thereby improving fault tolerance, it adds significant performance overhead. 4.Scheduling - In a parallel DBMS, each node knows exactly what it must do and when it must do it according to the distributed query plan. Each task in an MR system is scheduled on processing nodes one storage block at a time. 5.Column-oriented storage - In a column store-based database (such as Vertica), the system reads only the attributes necessary for solving the user query.

Conclusion MR has some good qualities: Out-of-the-box-experience, Most database systems cannot deal with tables stored in the file system DBMSs have some good qualities: Technologies and techniques for efficient query parallel execution, use of higher level languages. Parallel DBMSs excel at efficient querying of large data sets MR style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary. An ideal system would therefore be a “HYBRID” system. HadoopDB, 4 Hive, 21 Aster, Greenplum, Cloudera, and Vertica all have commercially available products or prototypes in this “hybrid” category.