Download presentation
Presentation is loading. Please wait.
1
A Hadoop MapReduce Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China
2
Background Hadoop MapReduce I N P U T D A Job Map Map Map Map Map
(Key, Value) Partion1 Partion2 + I N P U T D A Map Reduce Split HDFS
3
Background Hadoop Many steps within Map stage and Reduce stage
Different step may consume different type of resource Map R E A D Map S O R T M E R G O U T P
4
Motivation Problems Scheduling
No consideration about the execution time and different type of resources consumed Hadoop CPU Intensive Hadoop Hadoop Parameter Tuning Numerous parameters, default value is not optimal Hadoop Job Default Conf DefaultHadoop Job
5
Motivation Solution Scheduling
No consideration about the execution time and different type of resources consumed Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal
6
Lack of the analysis about Hadoop
Related Work Existing Prediction Method 1: - Black Box Based Hadoop Lack of the analysis about Hadoop Hard to choose Job Features Statistic/Learning Models Execution Time
7
Related Work Existing Prediction Method 2: - Cost Model Based Hadoop
Read Hadoop map Out put … reduce Output Hadoop Lots of concurrent processes Hard to divide stages Difficult to ensure accuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) Job Feature Execution Time
8
Related Work A Brief Summary about Existing Prediction Method
Black Box Cost Model Advantage Simple and Effective High accuracy High isomorphism Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction Short Coming Lack of job feature extraction Lack of analysis Hard to divide each step and resource A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction Simple prediction, Lack of jobs (jar package + data) analysis
9
Goal Prediction System
Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time
10
Design - 1 Cost Model C O S T Job M D E L - Map execution time
- Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time
11
Cost Model [1] Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
12
Cost Model [1] Type One:Constant Type Two:Job-related Parameters
Cost Function Parameters Analysis Type One:Constant Hadoop System Consume,Initialization Consume Type Two:Job-related Parameters Map Function Computational Complexity,Map Input Records Type Three:Parameters defined by Cost Model Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
13
Parameters Collection
Type One and Type Three Type one: Run empty map tasks,calculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. Type Two Run a new job,analyze log High Latency Large Overhead Sampling Data,only analyze the behavior of map function and reduce function Almost no latency Very low extra overhead Job Analyzer
14
Job Analyzer - Implementation
Hadoop virtual execution environment Accept the job Jar File & Input Data Sampling Module Sample input data by a certain percentage (less than 5%). MR Module Instantiate user job’s class in using Java reflection Analyze Module Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input) Jar File + Input Data Hadoop virtual execution environment MR Module Sampling Module Analyze Module Job Feature
15
Job Analyzer - Feasibility
Data similarity: Logs have uniform format Execution similarity: each record will be processed by the same map & reduce function repeatedly I N P U T D A Map Reduce Split
16
Design - 2 Parameters Collection Job Analyzer: C
Collect Parameters of Type 2 C O S T M D E L - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
17
Prediction Model Problem Analysis -Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization
18
Prediction Model Main Factors (according to the performance model) - Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data Initiation Read Data Network Transfer Create Object Map Function Sort In Memory Read/Write Disk Merge Write Serialization The number of input records (N) NlogN The complexity of Map function The conversion rate of Map data
19
Prediction Model Experimental Analysis
Test 4 kinds of jobs ( records) Extract the features for linear regression Calculate the correlation coefficient (R2) Jobs Dedup WordCount Project Grep Total R2 0.9982 0.9992 0.9991 0.9949 0.6157
20
Prediction Model Execution Time of Map Very good linear relationship within the same kind of jobs. But no linear relationship among different kind of jobs. Number of Records
21
Find the nearest jobs! Instance-Based Linear Regression Nearest:
Find the nearest samples to the jobs to be predicted in history logs “nearest”-> similar jobs (Top K nearest, with K=10%-15%) Do linear regression to the samples we have found Calculate the prediction value Nearest: The weighted distance of job features (weight w) High contribution for job classification: map/reduce complexity,map/reduce data conversion rate Low contribution for job classification: Data amount、Number of records
22
Search for the nearest samples
Prediction Module Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 1 2 5 Prediction Function 7 Prediction Results
23
Prediction Module Procedure Cost Model Find-Neighbor Module
Prediction Function Training Set Prediction Results
24
Design - 3 Parameters Collection Prediction Module Job Analyzer: C
Collect Parameters of Type 2 C O S T M D E L Prediction Module - Map execution time - Reduce execution time From now on, we can get the occupation time of each resource - CPU Occupation Time - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
25
Experience Task Execution Time (Error Rate)
K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G From now on, we can get the occupation time of each resource Job ID Job ID
26
Conclusion Job Analyzer : Prediction Module:
Analyze Job Jar + Input File Collect parameters Prediction Module: Find the main factor Propose a linear equation Job classification Multiple prediction
27
Thank you! Question?
28
Cost Model [1] Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources Reduce CPU: Disk: Net: Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization
29
Prediction Model Main Factors (according to the performance model) - Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data Initiation Read Data Network Transfer Create Object Reduce Function Merge Sort Read/Write Disk Write Disk Serialization Deserialization The number of input records NlogN The complexity of Reduce function The conversion rate of Map data The conversion rate of Reduce data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.