Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
SDN + Storage.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Edge Based Cloud Computing as a Feasible Network Paradigm(1/27) Edge-Based Cloud Computing as a Feasible Network Paradigm Joe Elizondo and Sam Palmer.
MapReduce.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
Using Map-reduce to Support MPMD Peng
Record Linkage in a Distributed Environment
Matchmaking: A New MapReduce Scheduling Technique
Towards Economic Fairness for Big Data Processing in Pay-as-you-go Cloud Computing Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Shanjiang Tang, Bu-Sung Lee, Bingsheng He, Haikun Liu School of Computer Engineering Nanyang Technological University Long-Term Resource Fairness Towards.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
Using Map-reduce to Support MPMD Peng
Part III BigData Analysis Tools (YARN) Yuan Xue
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Chapter 10 Data Analytics for IoT
Edinburgh Napier University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PA an Coordinated Memory Caching for Parallel Jobs
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Adaptive Data Refinement for Parallel Dynamic Programming Applications
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang, Bu-Sung Lee, Bingsheng He 1

OutLine Background & Motivations DHFS Evaluation Conclusion 2

MapReduce Computation Model Map Intermediate Result Map Reduce Output Result Reduce Output Result Reduce Output Result Reduce Output Result Final Result Map-Phase Computation Reduce-Phase Computation Input Data 3

Hadoop Execution Model Hadoop is an open-source implementation of MapReduce Model. The cluster computation resources are divided into map slots and reduce slots, which are configured by Hadoop administrator in advance. A MapReduce job generally consists of map tasks and reduce tasks. Map tasks have to be allocated with map slots, and reduce tasks have to be allocated with reduce slots. 4

Hadoop Execution Model 5 Map slotsReduce slots Map tasks start before reduce tasks Map tasks can only run on map slots, reduce tasks can only run on reduce slots Implication: Slots utilization can be poor for MapReduce Workloads under the current static slot configuration and allocation policy!!!

Our Goals To maximize the slot resource utilization for Hadoop cluster without any prior knowledge or assumption about MapReduce jobs. In other words, we want to achieve that at any time there should be no idle map/reduce slots available when there are pending tasks, i.e., trying to make slots as busy as possible. Our work focuses on Hadoop Fair Scheduler, i.e., improving the performance while guaranteeing fairness. 6

OutLine Background & Motivations DHFS Evaluation Conclusion 7

Our Approach We propose a dynamic slot allocation technique by breaking the existing slot allocation constrain: 1). Slots are generic and can be used by map and reduce tasks. 2). Map Tasks prefer to use map slots and likewise reduce tasks prefer to use reduce slots. In other words, Case 1:, no slot borrow is needed. Case 2:, borrow reduce slots for map tasks. Case 3:, borrow map slots for reduce tasks. Case 4:, no slot borrow is needed. 8

Dynamic Hadoop Fair Scheduler (DHFS) We provide two types of DHFS, based on different levels of fairness.  Pool-Independent DHFS (PI-DHFS)  Pool-dependent DHFS (PD-DHFS) Each MapReduce pool consists of two sub-pools:  Map-phase pool  Reduce-phase pool 9

PI-DHFS It’s subject to the ‘fairness’ concept of default Hadoop Fair Scheduler, i.e., fair share is done across phase-pools within a phase. The dynamic allocation process consists of two parts:  Intra-phase dynamic slot allocation  Inter-phase dynamic slot allocation 10

PI-DHFS It will compute Intra-phase dynamic slot allocation first, and then Inter-phase dynamic slot allocation. 11

PD-DHFS Fair share is done across pools, instead of phase. The dynamic allocation process consists of two parts:  Intra-pool dynamic slot allocation  Inter-pool dynamic slot allocation 12

PD-DHFS It will compute Intra-pool dynamic slot allocation first, and then Inter- pool dynamic slot allocation. 13

Overview of Slot Allocation Flow The slot allocation flow for each pool under PD-DHFS. 14 Reduce task assignment Map task assignment Pending map tasks and idle map slots? Pending reduce tasks and idle reduce slots? Pending map tasks? Pending reduce tasks? (4) (2) (1) (3) Yes No Yes No Yes No

OutLine Background & Motivations DHFS Evaluation Conclusion 15

Experimental Setup Enviroments  A Hadoop cluster consisting of 10 nodes, each with two Intel X5675 CPUs, 24GB memory and 56GB hard disks. Workloads  Tested Workload. It is a mix of three representative applications, WordCount, Sort, Grep, with Wikipedia article history dataset of different sizes, e.g., 10 GB, 20GB, 30GB, 40GB. 16

Execution Process for DHFS 17

Performance Improvement 18

Performance Improvement Under Different Percentages of Borrowed Map and Reduce Slots 19

OutLine Background & Motivations DHFS Evaluation Conclusion 20

Conclusion Current static slot configuration and allocation policy can make slot utilization poor. Two DHFSs (PI-DHFS, PD-DHFS) are proposed to address the slot utilization problem for Hadoop Fair Scheduler. Experimental results show that DHFS improves the performance of MapReduce workloads significantly while guaranteeing the fairness. The source code of DHFS is available at: 21

Acknowledgement This work is supported by the ”User and Domain driven data analytics as a Service framework” project under the A*STAR Thematic Strategic Research Programme (SERC Grant No ). Bingsheng He was partly supported by a startup Grant of Nanyang Technological University, Singapore. 22

23