Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013
p2 Outline r Background and motivation r Uses cases of our analytical model r Analytical model r Derived performance metrics r Numerical results r Conclusions and future work
p3 Background and Motivation r MapReduce is a popular paradigm that can parallelize large data processing on cloud clusters. r MR paradigm is a key enabler for Big Data analytics r MR Jobs – e.g. web search engine requests r In cloud computing, a critical research problem is how to achieve elasticity for MR jobs as the workload conditions change over time.
p4 Elasticity r Elasticity is how fast the cloud responds (or autoscales) to a given workload to reach perfect capacity. Overprovisioning D(t) < R(t) Underprovisioning D(t) > R(t) Perfect Provisioning D(t) = R(t)
p5 MapReduce Jobs
p6 Usefulness of our model (1/2) r In elasticity and autoscaling: given workload conditions, we can estimate the required number of VMs to meet the SLO delay requirements And not by trial and error CPU utilization can be misleading r Determine the required slave nodes required to execute MR jobs
p7 Usefulness of our model (2/2) r In call admission To accept or deny cloud requests based on meeting the SLO delay Available compute resources are not enough r Estimating the end-to-end delay for elastic MR jobs
p8 Typical Cloud Datacenter Architecture
p9 M/G/1/K Queueing Model
p10 M/G/1/K
p11 Analysis Approach r The challenge in analyzing such a queueing system is to compute or the PDF of the generally distributed random variable X representing the service times r The mean service time E[X] r Then, the second stage random service time B for these N parallel workers can be expressed as r E[B] can be expressed as
p12 Analysis Approach r For the Reducer stage, r E[R] can be expressed as r Therefore, the mean service time E[X]
p13 Performance r Given: Incoming load JS and service rates for each mapper & reducer Queue size r Formulas for: Response time Throughput Loss probability
p14 Numerical Example r We fix the system size K to 100 requests. We fix r depends on two factors: (1) m-- the number of mapper per node, and (2) the execution speed of each node. If we assume a reducer takes 500 ms to be executed on a single node, and with homogenous splitting, then ms.
p15 Numerical Example r Similarly, depends on two factors: (1) n-- the number of mapper per node, and (2) the execution speed of each node. If we assume a reducer takes 100 ms to be executed on a single node, and with homogenous splitting, then ms. r For autoscaling, we assume that the mappers and reducers always autoscale with a ratio of 2:1. That is, one reducer is needed for two mappers, or
p16 Service Delay vs. Workload
p17
p18
p19
p20 Concluding Remarks r We presented analytical model to estimate the minimum number of cloud resources required for executing MapReduce jobs on the cloud r Closed-form solutions were derived for key SLO performance metrics such as response time, blocking probability, and throughput. r Simulation results show that our analytical model is correct. r Future work will be on implementation
p21 Thank you!
p22 Q&A