Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Overview of MapReduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Ch 4. The Evolution of Analytic Scalability
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce VS Parallel DBMSs
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
MapReduce and the New Software Stack CHAPTER 2 1.
CS4432: Database Systems II Query Processing- Part 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Hadoop Aakash Kag What Why How 1.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Plan The Map/Reduce model Two performance problems and ways out – Blocking steps in the Map/Reduce processing pipeline – Non-selective data access Conclusion: toward Map/Reduce optimization? 03/04/132Ioana Manolescu (Inria), Big Data Paris

The Map/Reduce model Problem: – How to compute in a distributed fashion a given processing to a very large amount of data Map/Reduce solution: – Programmer: express the processing as Splitting the data Extract (key, value pairs) from each partition (MAP) Compute partial results for each key (REDUCE) – Map/Reduce platform (e.g., Hadoop): Distributes partitions, runs one MAP task per partition Runs one or several REDUCE tasks per key Sends data across machines from MAP to REDUCE 03/04/13Ioana Manolescu (Inria), Big Data Paris3 Implicit parallelism Fault tolerance Communication across machines

Ioana Manolescu (Inria), Big Data Paris4 Map/Reduce in detail 03/04/13

Hadoop Map/Reduce 03/04/13 Ioana Manolescu (Inria), Big Data Paris5 Data Load Map() Local sort Map write Merge Reduce Final write Merge Reduce Final write Merge Reduce Final write Merge Reduce Final write Data Load Map() Local sort Map write Data Load Map() Local sort Map write Data Load Map() Local sort Map write Map Reduce

Performance problem 1: Idle CPU due to blocking steps

Hadoop resource usage 03/04/13 Ioana Manolescu (Inria), Big Data Paris7 Data Load Map() Local sort Map write Merge Reduce Final write Merge Reduce Final write Data Load Map() Local sort Map write Blocking sort Blocking sort-merge

Hadoop benchmark [Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011] CPU stalls during I/O intensive Merge Reduce strictly follows Merge 03/04/13Ioana Manolescu (Inria), Big Data Paris8

Hash-based algorithms to improve Hadoop performance [Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011] Main idea: use non-blocking hash-based algorithms to group items by keys during Map.LocalSort and Reduce.Merge Principle of hashing: Partitions can be in memory or flushed to disk If the reduce works incrementally, early send 03/04/13Ioana Manolescu (Inria), Big Data Paris9 Data Load Map() Local sort Map write Merge Reduce Final write keyvalue 09, 3, 3 11, 13, 1 22, 2, 8, 2 3, 1, 2, 8, 2, 13, 1, 2, 3, 9… h(x) e.g. x % 3 h(x) e.g. x % 3

Performance problem 2: non-selective data access

Data access in Hadoop 03/04/1311Ioana Manolescu (Inria), Big Data Paris Basic model: read all the data – If the tasks are selective, we don't really need to! Database indexes? But: – Map/Reduce works on top of a file system (e.g. Hadoop file system, HDFS) – Data is stored only once – Hard to foresee all future processing "Exploratory nature" of Hadoop Data Load Map() Local sort Map write Merge Reduce Final write

Accelerating data access in Hadoop Idea 1: Hadop++ [Jindal, Quiané-Ruiz, Dittrich, ACM SOCC, 2011] – Add header information to each data split, summarizing split attribute values – Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits 03/04/13Ioana Manolescu (Inria), Big Data Paris12 Data Load Map() Local sort Map write Merge Reduce Final write

Accelerating data access in Hadoop Idea 2: HAIL [Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012] – Each storage node builds an in-memory, clustered index of the data in its split – There are three copies of each split for reliability  Build three different indexes! – Customize RecordReader 03/04/13Ioana Manolescu (Inria), Big Data Paris13 Data Load Map() Local sort Map write Merge Reduce Final write

Hadoop, Hadoop++ and HAIL 03/04/13Ioana Manolescu (Inria), Big Data Paris14 Data Load Map() Local sort Map write Merge Reduce Final write

Conclusion Toward Map/Reduce optimization

Optimization defined as Hadoop tuning Hadoop parameters: [Herodotou, technical report, Duke Univ, 2011] 03/04/13Ioana Manolescu (Inria), Big Data Paris16

Hadoop tuning [Babu, ACM SoCC 2010] 03/04/13Ioana Manolescu (Inria), Big Data Paris17

Hadoop performance model 03/04/13Ioana Manolescu (Inria), Big Data Paris18 [Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Hadoop performance model 03/04/13Ioana Manolescu (Inria), Big Data Paris19 [Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011] Easy

References Shivnath Babu. "Towards automatic optimization of MapReduce programs", ACM SoCC, 2010 Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011 Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011 Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012 A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011 Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB /04/13Ioana Manolescu (Inria), Big Data Paris20

Merci Questions now? Questions/feedback later on: 03/04/1321Ioana Manolescu (Inria), Big Data Paris