MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Andy Pavlo April 13, 2015April 13, 2015April 13, 2015 NewS QL.
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
Parallel Databases Michael French, Spencer Steele, Jill Rochelle When Parallel Lines Meet by Ken Rudin (BYTE, May 98)
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
Distributed databases
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
Tutorial at ISWC 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.
Overview Distributed vs. decentralized Why distributed databases
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Cloud Computing Other Mapreduce issues Keke Chen.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Database Architectures and the Web
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
MapReduce VS Parallel DBMSs
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.
李智宇、 林威宏、 施閔耀. + Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
A Comparison of Approaches to Large-Scale Data Analysis Erik Paulson Computer Sciences Department.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
H-Store: A Specialized Architecture for High-throughput OLTP Applications Evan Jones (MIT) Andrew Pavlo (Brown) 13 th Intl. Workshop on High Performance.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
Mapping the Data Warehouse to a Multiprocessor Architecture
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop MapReduce Framework
Mapping the Data Warehouse to a Multiprocessor Architecture
A Comparison of Approaches to Large-Scale Data Analysis
Overview of big data tools
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
EECS 262a Advanced Topics in Computer Systems Lecture 21 Comparison of Parallel DB, CS, MR and Spark November 11th, 2018 John Kubiatowicz Electrical.
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin Communications of the ACM, vol. 53, iss. 1, pp , Presentation and slides by Elisa Tvete, Jim Avery

P ARALLEL DBMS ARCHITECTURE Multiple nodes running database software “Shared-nothing nodes” - separate CPU, memory, disks Data horizontally partitioned across all nodes Each node runs query on own data Results returned to central processing node Central node calculates final result

M AP R EDUCE ARCHITECTURE Several computing nodes used Data not pre-loaded Query has “Map” and “Reduce” components Key/value data is distributed to nodes Nodes perform “Map” step Results are returned to central processing node

P ERFORMANCE T RADE - OFFS D EMONSTRATION Three systems: – Hadoop MR Framework – Vertica, a column-store relational database – DBMS-X, a row-based database Three tasks: – Original MR Grep task SELECT * FROM Data WHERE field LIKE `%XYZ%'; – Web log task SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; – Join task

D EMONSTRATION R ESULTS

MR C OMPLEMENTS P ARALLEL DBMS MR good at extract-transform-load queries Extract raw data, process it, load into DBMS Can perform complex analytics more easily Queries not suitable for single SQL query Can use data without strictly defined schema MR functions can enhance parallel DBMS!

C ONCLUSION Architectural Differences – Repetitive record parsing – Compression – Pipelining – Scheduling Discussion Coexistence

R ESOURCES M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?," Communications of the ACM, vol. 53, iss. 1, pp , A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. Brown University Data Management Research Group, 26 Feb Web. 24 Aug 2011.