Presentation is loading. Please wait.

Presentation is loading. Please wait.

Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.

Similar presentations


Presentation on theme: "Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi."— Presentation transcript:

1 Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi

2 2 발표 전날

3 3 이번에 발표 못하면 끝이야 !!!! 학점 받기는 불가능해 !!!! + 졸업시험 !!

4 4 시간안에 죽지않고 발표 준비를 마칠 수 있을까

5 5 목차 1. Introduction 2. Profiler 3. What-if engine 4. Cost-based optimizer 5. Experimental evaluation 6. Conclusion

6 6 Introduction MapReduce has emerged as a viable competitor to database systems in big data analytics. Profiler, What-if Engine, Cost-based Optimizer  Profiler : collect detailed statistical information from unmodified MapReduce programs.  What-if Engine : fine-grained costestimation.  Cost-based Optimizer : optimize configuration parameter setting.

7 7 Introduction MapReduce job J  J =  p: MapReduce program  d: map(k1, v1) 과 reduce(k2, list(v2)) 두 함수를 통해 입력되는 data  r: Cluster resources  c: Configuration parameter settings

8 8 Introduction Configuration parameter settings include..  The number of map tasks  The number of reduce tasks  The amount of memory  The settings for multiphase external sorting  Whether the output data from the map (reduce) tasks should be compressed before being written to disk  Whether a program-specified Combiner function should be used to preaggregate map outputs before their transfer to reduce tasks.

9 9 Introduction

10 10 Introduction

11 11 Introduction Costbased Optimization to Select Configuration Parameter Settings Automatically  perf = F(p, d, r, c)  perf is some performance metric of interest for jobs  Optimizing the performance of program p for given input data d and cluster resources r requires finding configuration parameter settings that give near-optimal values of perf.

12 12 Introduction MapReduce program optimization poses new challenges compared to conventional database query optimization  Black-box map and reduce functions  Lack of schema and statistics about the input data  Differences in plan spaces Cost-based Optimizer  Profiler  What-if Engine  Cost-based Optimizer

13 13 Profiler Phase of Map Task Execution  Read, Map, Collect, Spill, Merge Phase of Reduce Task Execution  Shuffle, Merge, Reduce, Write

14 14 Profiler Job Profiler  A MapReduce job profile is a vector in which each field captures some unique aspect of dataflow or cost during job execution at the task level or the phase level within tasks.  Data flow fields  Cost fields  Dataflow Statistics fields  Cost Statistics fields

15 15 Profiler Using Profiles to Analyze Job Behavior

16 16 Profiler Generating Profiles via Measurement  Job profiles are generated in two distinct ways.(Profiler, What-if Engine)  Monitoring through dynamic instrumentation  From raw monitoring data to profile fields  Task-level sampling to generate approximate profiles

17 17 What-if Engine A what-if question has the following form  Given the profile of a job j = hp; d1; r1; c1i that runs a MapReduce program p over input data d1 and cluster resources r1 using configuration c1, what will the performance of program p be if p is run over input data d2 and cluster resources r2 using configuration c2? That is, how will job j0 = hp; d2; r2; c2i perform? The What-if Engine executes the following two steps to answer a what-if question  Estimating a virtual job profile for the hypothetical job j’.  Using the virtual profile to simulate how j’ will execute. We will discuss these steps in turn.

18 18 What-if Engine Estimating the Virtual Profile  Estimating Dataflow and Cost fields  Estimating Dataflow Statistics fields  Estimating Cost Statistics fields

19 19 What-if Engine Estimating Dataflow and Cost fields  detailed set of analytical (white-box) models for estimating the Dataflow and Cost fields in the virtual job profile for j'. Estimating Dataflow Statistics fields  Dataflow proportionality assumption Estimating Cost Statistics fields  Cluster node homogeneity assumption Simulating the Job Execution  Task Scheduler Simulator

20 20 Cost-based Optimizer (CBO) MapReduce program optimization can be defined as  Given a MapReduce program p to be run on input data d and cluster resources r, find the setting of configuration parameters for the cost model F represented by the What-if Engine over the full space S of configuration parameter settings. The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S. Once a job profile to input to the What-if Engine is available, the CBO uses a two-step process, discussed next.

21 21 Cost-based Optimizer (CBO) Subspace Enumeration  A straightforward approach the CBO can take is to apply enumeration and search techniques to the full space of parameter settings S.  More efficient search techniques can be developed if the individual parameters in c can be grouped into clusters.  Equation 2 states that the globally-optimal setting c opt can be found using a divide and conquer approach by : breaking the higher-dimensional space S into the lower-dimensional subspaces S (i) considering an independent optimization problem in each smaller subspace composing the optimal parameter settings found per subspace to give the setting c opt

22 22 Cost-based Optimizer (CBO) Search Strategy within a Subspace  searching within each enumerated subspace to find the optimal configuration in the subspace.  Gridding (Equispaced or Random)  Recursive Random Search (RRS) RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting RRS is fairly robust to deviations of estimated costs from actual performance RRS scales to a large number of dimensions

23 23 Cost-based Optimizer (CBO) there are two choices for subspace enumeration: Full or Clustered that deal respectively with the full space or smaller subspaces for map and reduce tasks three choices for search within a subspace: Gridding (Equispaced or Random) and RRS.

24 24 Experimental Evaluation

25 25 Experimental Evaluation

26 26 Experimental Evaluation

27 27 Experimental Evaluation

28 28 Experimental Evaluation

29 29 Experimental Evaluation

30 30 Discussion and Future work Costbased Optimizer for simple to arbitrarily complex MapReduce programs. Several new research challenges arise when we consider the full space of optimization opportunities provided by these higher-level systems. proposed a lightweight Profiler to collect detailed statistical information from unmodified MapReduce programs. proposed a What-if Engine for the fine-grained cost estimation needed by the Cost-based Optimizer.

31 Q & A 31

32 32 좋아 ! 이정도면 선방했 …


Download ppt "Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi."

Similar presentations


Ads by Google