A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China

Background Hadoop MapReduce I N P U T D A Job Map Map Map Map Map
(Key, Value) Partion1 Partion2 + I N P U T D A Map Reduce Split HDFS

Background Hadoop Many steps within Map stage and Reduce stage
Different step may consume different type of resource Map R E A D Map S O R T M E R G O U T P

Motivation Problems Scheduling
No consideration about the execution time and different type of resources consumed Hadoop CPU Intensive Hadoop Hadoop Parameter Tuning Numerous parameters, default value is not optimal Hadoop Job Default Conf DefaultHadoop Job

Motivation Solution Scheduling
No consideration about the execution time and different type of resources consumed Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal

Lack of the analysis about Hadoop
Related Work Existing Prediction Method 1： - Black Box Based Hadoop Lack of the analysis about Hadoop Hard to choose Job Features Statistic/Learning Models Execution Time

Related Work Existing Prediction Method 2： - Cost Model Based Hadoop
Read Hadoop map Out put … reduce Output Hadoop Lots of concurrent processes Hard to divide stages Difficult to ensure accuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) Job Feature Execution Time

Related Work A Brief Summary about Existing Prediction Method
Black Box Cost Model Advantage Simple and Effective High accuracy High isomorphism Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction Short Coming Lack of job feature extraction Lack of analysis Hard to divide each step and resource A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction Simple prediction, Lack of jobs (jar package + data) analysis

Goal Prediction System
Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Design - 1 Cost Model C O S T Job M D E L - Map execution time
- Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Cost Model [1] Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Cost Model [1] Type One：Constant Type Two：Job-related Parameters
Cost Function Parameters Analysis Type One：Constant Hadoop System Consume，Initialization Consume Type Two：Job-related Parameters Map Function Computational Complexity，Map Input Records Type Three：Parameters defined by Cost Model Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Parameters Collection
Type One and Type Three Type one: Run empty map tasks，calculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. Type Two Run a new job，analyze log High Latency Large Overhead Sampling Data，only analyze the behavior of map function and reduce function Almost no latency Very low extra overhead Job Analyzer

Job Analyzer - Implementation
Hadoop virtual execution environment Accept the job Jar File & Input Data Sampling Module Sample input data by a certain percentage (less than 5%). MR Module Instantiate user job’s class in using Java reflection Analyze Module Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input) Jar File + Input Data Hadoop virtual execution environment MR Module Sampling Module Analyze Module Job Feature

Job Analyzer - Feasibility
Data similarity: Logs have uniform format Execution similarity: each record will be processed by the same map & reduce function repeatedly I N P U T D A Map Reduce Split

Design - 2 Parameters Collection Job Analyzer: C
Collect Parameters of Type 2 C O S T M D E L - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Prediction Model Problem Analysis -Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization

Prediction Model Main Factors (according to the performance model) - Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization The number of input records (N) NlogN The complexity of Map function The conversion rate of Map data

Prediction Model Experimental Analysis
Test 4 kinds of jobs ( records) Extract the features for linear regression Calculate the correlation coefficient (R2) Jobs Dedup WordCount Project Grep Total R2 0.9982 0.9992 0.9991 0.9949 0.6157

Prediction Model Execution Time of Map Very good linear relationship within the same kind of jobs. But no linear relationship among different kind of jobs. Number of Records

Find the nearest jobs! Instance-Based Linear Regression Nearest：
Find the nearest samples to the jobs to be predicted in history logs “nearest”-> similar jobs (Top K nearest, with K=10%-15%) Do linear regression to the samples we have found Calculate the prediction value Nearest： The weighted distance of job features (weight w) High contribution for job classification： map/reduce complexity，map/reduce data conversion rate Low contribution for job classification： Data amount、Number of records

Search for the nearest samples
Prediction Module Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 1 2 5 Prediction Function 7 Prediction Results

Prediction Module Procedure Cost Model Find-Neighbor Module
Prediction Function Training Set Prediction Results

Design - 3 Parameters Collection Prediction Module Job Analyzer: C
Collect Parameters of Type 2 C O S T M D E L Prediction Module - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Experience Task Execution Time (Error Rate)
K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G From now on, we can get the occupation time of each resource Job ID Job ID

Conclusion Job Analyzer : Prediction Module:
Analyze Job Jar + Input File Collect parameters Prediction Module: Find the main factor Propose a linear equation Job classification Multiple prediction

Thank you! Question?

Cost Model [1] Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources Reduce CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization

Prediction Model Main Factors (according to the performance model) - Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization The number of input records NlogN The complexity of Reduce function The conversion rate of Map data The conversion rate of Reduce data

A Hadoop MapReduce Performance Prediction Method

Similar presentations

Presentation on theme: "A Hadoop MapReduce Performance Prediction Method"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Hadoop MapReduce Performance Prediction Method

Similar presentations

Presentation on theme: "A Hadoop MapReduce Performance Prediction Method"— Presentation transcript:

Similar presentations

About project

Feedback