A Hadoop MapReduce Performance Prediction Method

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
New Challenges in Cloud Datacenter Monitoring and Management
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Inter-process Communication in Hadoop
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Systems Analysis and Design Trisha Cummings.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Chapter 12 Recursion, Complexity, and Searching and Sorting
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
New Mexico Computer Science For All Algorithm Analysis Maureen Psaila-Dombrowski.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
Lecture 2: Review of Object Orientation. © Lethbridge/La ganière 2005 Chapter 2: Review of Object Orientation What is Object Orientation? Procedural.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Chapter 12: Query Processing
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Cloud Distributed Computing Environment Hadoop
Predictive Performance
Data processing with Hadoop
Charles Tappert Seidenberg School of CSIS, Pace University
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Map Reduce, Types, Formats and Features
Presentation transcript:

A Hadoop MapReduce Performance Prediction Method Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China

Background Hadoop MapReduce I N P U T D A Job Map Map Map Map Map (Key, Value) Partion1 Partion2 + I N P U T D A Map Reduce Split HDFS

Background Hadoop Many steps within Map stage and Reduce stage Different step may consume different type of resource Map R E A D Map S O R T M E R G O U T P

Motivation Problems Scheduling No consideration about the execution time and different type of resources consumed Hadoop CPU Intensive Hadoop Hadoop Parameter Tuning Numerous parameters, default value is not optimal Hadoop Job Default Conf DefaultHadoop Job

Motivation Solution Scheduling No consideration about the execution time and different type of resources consumed Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal

Lack of the analysis about Hadoop Related Work Existing Prediction Method 1: - Black Box Based Hadoop Lack of the analysis about Hadoop Hard to choose Job Features Statistic/Learning Models Execution Time

Related Work Existing Prediction Method 2: - Cost Model Based Hadoop Read Hadoop map Out put … reduce Output Hadoop Lots of concurrent processes Hard to divide stages Difficult to ensure accuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) Job Feature Execution Time

Related Work A Brief Summary about Existing Prediction Method Black Box Cost Model Advantage Simple and Effective High accuracy High isomorphism Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction Short Coming Lack of job feature extraction Lack of analysis Hard to divide each step and resource A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction Simple prediction, Lack of jobs (jar package + data) analysis

Goal Prediction System Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Design - 1 Cost Model C O S T Job M D E L - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Cost Model [1] Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Cost Model [1] Type One:Constant Type Two:Job-related Parameters Cost Function Parameters Analysis Type One:Constant Hadoop System Consume,Initialization Consume Type Two:Job-related Parameters Map Function Computational Complexity,Map Input Records Type Three:Parameters defined by Cost Model Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Parameters Collection Type One and Type Three Type one: Run empty map tasks,calculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. Type Two Run a new job,analyze log High Latency Large Overhead Sampling Data,only analyze the behavior of map function and reduce function Almost no latency Very low extra overhead Job Analyzer

Job Analyzer - Implementation Hadoop virtual execution environment Accept the job Jar File & Input Data Sampling Module Sample input data by a certain percentage (less than 5%). MR Module Instantiate user job’s class in using Java reflection Analyze Module Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input) Jar File + Input Data Hadoop virtual execution environment MR Module Sampling Module Analyze Module Job Feature

Job Analyzer - Feasibility Data similarity: Logs have uniform format Execution similarity: each record will be processed by the same map & reduce function repeatedly I N P U T D A Map Reduce Split

Design - 2 Parameters Collection Job Analyzer: C Collect Parameters of Type 2 C O S T M D E L - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Prediction Model Problem Analysis -Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization

Prediction Model Main Factors (according to the performance model) - Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization The number of input records (N) NlogN The complexity of Map function The conversion rate of Map data

Prediction Model Experimental Analysis Test 4 kinds of jobs (0-10000 records) Extract the features for linear regression Calculate the correlation coefficient (R2) Jobs Dedup WordCount Project Grep Total R2 0.9982 0.9992 0.9991 0.9949 0.6157

Prediction Model Execution Time of Map Very good linear relationship within the same kind of jobs. But no linear relationship among different kind of jobs. Number of Records

Find the nearest jobs! Instance-Based Linear Regression Nearest: Find the nearest samples to the jobs to be predicted in history logs “nearest”-> similar jobs (Top K nearest, with K=10%-15%) Do linear regression to the samples we have found Calculate the prediction value Nearest: The weighted distance of job features (weight w) High contribution for job classification: map/reduce complexity,map/reduce data conversion rate Low contribution for job classification: Data amount、Number of records

Search for the nearest samples Prediction Module Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 1 2 5 Prediction Function 7 Prediction Results

Prediction Module Procedure Cost Model Find-Neighbor Module Prediction Function Training Set Prediction Results

Design - 3 Parameters Collection Prediction Module Job Analyzer: C Collect Parameters of Type 2 C O S T M D E L Prediction Module - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Experience Task Execution Time (Error Rate) K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G From now on, we can get the occupation time of each resource Job ID Job ID

Conclusion Job Analyzer : Prediction Module: Analyze Job Jar + Input File Collect parameters Prediction Module: Find the main factor Propose a linear equation Job classification Multiple prediction

Thank you! Question?

Cost Model [1] Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources Reduce CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization

Prediction Model Main Factors (according to the performance model) - Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization The number of input records NlogN The complexity of Reduce function The conversion rate of Map data The conversion rate of Reduce data