A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhangGuoweiWang ZijingYang*YangDing School of Computer Science and Technology.

Slides:

Advertisements

Similar presentations

Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.

Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs John Oleszkiewicz, Li Xiao, Yunhao Liu IEEE TRASACTION ON PARALLEL.

1 Adaptive Live Broadcasting for Highly-Demanded Videos Hung-Chang Yang, Hsiang-Fu Yu and Li-Ming Tseng IEEE International Conference on Parallel and Distributed.

Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.

A Hadoop MapReduce Performance Prediction Method

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Ch 4. The Evolution of Analytic Scalability

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

An Architecture for Video Surveillance Service based on P2P and Cloud Computing Yu-Sheng Wu, Yue-Shan Chang, Tong-Ying Juang, Jing-Shyang Yen speaker:

A Novel Adaptive Distributed Load Balancing Strategy for Cluster CHENG Bin and JIN Hai Cluster.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Mining High Utility Itemset in Big Data

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Using Map-reduce to Support MPMD Peng

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Matchmaking: A New MapReduce Scheduling Technique

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

Dzmitry Kliazovich University of Luxembourg, Luxembourg

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Desktop Workload Characterization for CMP/SMT and Implications for Operating System Design Sven Bachthaler Fernando Belli Alexandra Fedorova Simon Fraser.

Group # 14 Dhairya Gala Priyank Shah. Introduction to Grid Appliance The Grid appliance is a plug-and-play virtual machine appliance intended for Grid.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

Tutorial: Big Data Algorithms and Applications Under Hadoop

Scaling Spark on HPC Systems

Chapter 10 Data Analytics for IoT

Load Balancing and It’s Related Works in Cloud Computing

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

湖南大学-信息科学与工程学院-计算机与科学系

Scheduling Jobs in Multi-Grid Environment

CS 345A Data Mining MapReduce This presentation has been altered.

Presentation transcript:

A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology Henan Polytechnic University 2012 International Conference on Systems and Informatics (ICSAI 2012) Speaker: 張峻榕 WMC Lab 1

Outline Motivation Introduction Background Design and Implementation Experiment Conclusion and Future work 2

Motivation Reduce task issue massive remote I/O operations to copy the intermediate results of map tasks This paper propose an execution engine for reduce task 3

Introduction Comparison 4 Condie et al.Seo et al.Su et al. MethodPushing wayPre-SufflingLocality-aware FeatureTcpData dependenceStoring the intermediate results of each node DisadvantageOccupy network width Only reduce the delay of copying intermediate results Cannot benefit from the parallel exection

MapReduce Data Flow 5

Background Intermediate results in the form of Partitons intermediate results into different classes according to the keys Execute reduce task three steps: – the copy step – the sort step – the reduce step 6

The copy step 7 A job includes 31 map tasks and 2 reduce tasks. Supposing the reduce tasks are process on n4 and n9, and then n4 issue 27 remote I/O operations, n9 issue 28 remote I/O operations

The copy step Data transmission delay – Reduce tasks issue massive remote I/O operations cause massive delay and degrade system performance 8

Design and Implementation Propose an execution engine for reduce tasks – First phase : select the nodes, assigns reduce tasks, and then order the nodes to prefectch intermediate result – Second phase : the nodes allocates resources for reduce tasks, and run these tasks The remot access delay of results can be hidden 9

Design and Implementation The engine is comprised of an engine server and many engine clients – The engine server : It selects the nodes to run reduce tasks It decides the number of the tasks that each selected node will run – The engine clients : prefeches intermediate results for the reduce tasks dispatched to the same nodes 10

Design and Implementation 11

Resource competition 12 Map task Reduce task (Waiting State) Node Resource (competition)

Problem Resource competition in the node ―Reduce tasks do not release the resource in the waiting states Network resource competition ―Client request information of the completed map tasks from the sever periodically 13

Solution Resource competition in the node ―The engine imposes restrictions on the time starting to run reduce task Network resource competition ―The engine postpones reduce tasks scheduling until the amount of completed map tasks reaches a certain number 14

Experiment Environment – The execution engine in Hadoop – Linux cluster, cluster intcluded 11 nodes First rack included 4 nodes and second rack included 7 nodes, all the nodes were connected by Gigabit switch 15

Experiment 16

Experiment Criteria : mean execution time When the number of the completed map tasks reached to 5% of the total map tasks, Hadoop began to schedule reduce tasks, and kept them in first phase Until the percentage of the completed map tasks reached to a special number, reduce tasks can enter into the second phase(special number configure to 10%, 20%, …, 90%) 17

Experiment Horizontal axis : the percentage of the completed map tasks, which controlled the reduce task to enter into the second phase Vertical axis : the mean execution time of each ran 18

Experiment 19

Experiment 20

Experiment 21

22

Experiment 23

Conclusion and Future work Conclusion – The results showed that the engine optimized the performance of Hadoop in most cases Future work – Bottleneck Client and Server Network width How to know reduce task want to what intermediate results 24

Thank you 25