Presentation is loading. Please wait.

Presentation is loading. Please wait.

CherryPick: Adaptively Unearthing the Best

Similar presentations


Presentation on theme: "CherryPick: Adaptively Unearthing the Best"— Presentation transcript:

1 CherryPick: Adaptively Unearthing the Best
Cloud Configurations for Big Data Analytics 如何自适应地为大数据分析任务推荐最佳的云配置 Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang

2 Recurring jobs are popular
2 2 Apps & Data Frameworks 现在云端上的周期性重复的大数据任务非常普遍,根据微软的统计必应上40%的key job都是周期性重复的作业 CherryPick的面向的场景是对于周期性的重复性作业推荐合适的云配置 因为CherryPick会进行多次运行迭代逼近较优方案,因此需要任务本身会周期运行。同时据统计40%的作业都是周期性重复作业,所以只做重复性作业也是有价值的。 Recurring Microsoft reports 40% of key jobs at Bing are rerun periodically.

3 n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ...
Providers Machine Type 3 Cluster Size r3.8xlarge, i2.8xlarge, m4.8xlarge, c4.8xlarge, r4.8xlarge, m3.8xlarge, … 10s~ options A0, A1, A2, A3, A4, A11, Hundreds of instance types and instance count combinations X X A12, D1, D2, D3, L4s, … n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ... + configurable VMs

4 Choosing a good configuration is important
4 Choosing a good configuration is important Better performance: - For the same cost: best/worst running time is up to 3x - Worst case has good CPUs whereas the memory is bottlenecked. Lower cost: For the same performance: best/worst cost is up to 12x No need for expensive dedicated disks in the worst config. 2$ vs. 24$ per job with 100s of monthly runs

5 for a recurring job, given its representative workload?
5 How to find the best cloud configuration One that minimizes the cost given a performance constraint for a recurring job, given its representative workload?

6 Close the optimal configurations Works across all big-data apps
6 High Accuracy Adaptivity Close the optimal configurations Works across all big-data apps Key Challenges Modeling Searching Low Overhead Only runs a few configuration

7 Existing solution: modeling
8 Existing solution: modeling Modeling the resource-perf/cost tradeoffs Ernest [NSDI 16] models machine learning apps for each machine type Problem: not adaptive Frameworks (e.g., MapReduce or Spark) Applications (e.g., machine learning or database) Machine type (e.g., memory vs CPU intensive)

8 Existing solution: searching
7 Existing solution: searching Exhaustively search for the best cloud configuration without relying on an accurate performance Problem: high overhead If you choose a sample not carefully ,may end up needing tens if not hundreds of runs to identify the best instance.

9 Exhaustive High overhead
9 Exhaustive High overhead High Accuracy Adaptivity CherryPick Modeling Ernest Not general Searching Coord. Descent Not accurate Low Overhead

10 Key idea of CherryPick - Accuracy: modeling for ranking configurations
10 Key idea of CherryPick - Adaptivity: black-box modeling - Without knowing the structure of each application - Accuracy: modeling for ranking configurations - No need to be accurate everywhere - Low overhead: interactive searching - Smartly select next run based on existing runs

11 Workflow of CherryPick
11 Workflow of CherryPick Start with any config Run the config Black-box Modeling Rank and choose next config Return config Interactive search

12 Workflow of CherryPick
12 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

13 Workflow of CherryPick
13 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

14 Workflow of CherryPick
14 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

15 Workflow of CherryPick
15 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

16 Workflow of CherryPick
16 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search 𝐶 𝑋 =𝑃 𝑋 ×𝑇 𝑋 𝑃 𝑋 :配置X单位时间的花费 𝑇 𝑋 :任务运行在配置为X的云上的时间

17 - Bayesian optimization is good at handling additive noise
32 Noises in the cloud - Noises are common in the cloud Shared environment means inherent noise from other tenants Even more noisy under failures, stragglers Strawman solution: run multiple times, but high overhead. - - Bayesian optimization is good at handling additive noise - Merge the noise in the confidence interval (learned by monitoring or historical data) Noise

18 Workflow of CherryPick
17 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function

19 Workflow of CherryPick
18 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function

20 Workflow of CherryPick
19 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function

21 Prior function for black box modeling
20 Prior function for black box modeling Feature vector: CPU/RAM/Machine count/… This is the actual cost function curve across all configurations.

22 Prior function for black box modeling
21 Prior function for black box modeling Challenge: what can we infer about configurations with two runs?

23 Prior function for black box modeling
22 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? There are many valid functions passing through two points

24 Prior function for black box modeling
23 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? Solution: confidence intervals capture where the function likely lies

25 Workflow of CherryPick
24 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function

26 Acquisition function for choosing the next config
25 Acquisition function for choosing the next config Build an acquisition function based on the prior function Calculate the expected improvement in comparison to the current best configuration

27 Acquisition function for choosing the next config
26 Acquisition function for choosing the next config Highest expected improvement

28 Acquisition function for choosing the next config
27 Acquisition function for choosing the next config It then runs the new configuration.

29 28 Iterative searching And update the model.

30 CherryPick searches in the area that matters
29 CherryPick searches in the area that matters A database application with 66 configurations

31 CherryPick searches in the area that matters
30 CherryPick searches in the area that matters A database application with 66 configurations

32 CherryPick searches in the area that matters
31 CherryPick searches in the area that matters A database application with 66 configurations

33 Further customizations
34 Further customizations - Discretize features: Deal with infeasible configs and large searching space Discretize the feature space - Stopping condition: Trade-off between accuracy and searching cost Use the acquisition function knob - Starting condition: Should fully cover the whole space Use quasi-random search

34 Evaluation settings 5 big-data benchmarks 66 cloud configurations
35 Evaluation settings 5 big-data benchmarks 66 cloud configurations Database: TPC-DS TPC-H MapReduce: TeraSort Machine Learning: SparkML Regression SparkML Kmeans 30 GB-854 GB RAM cores 5 machine types

35 Evaluation Settings Metrics
36 Evaluation Settings Metrics Accuracy: running cost compared to the optimal configuration Overhead: searching cost of suggesting a configuration Comparing with Searching: random search, coordinate descent Modeling: Ernest [NSDI’16]

36 CherryPick has high accuracy with low overhead
Not accurate normalized by opt. Config. (%) Running cost 175 150 Coord. search has 20% more searching cost in median and 37 78% higher running cost in the Random search Random sear tail for TPC-DS 125 Coordinate search 100 Not adaptable Ernest CherryPick 3.7 times searching cost Searching cost normalized by CherryPick’s median (%) TPC-DS

37 CherryPick corrects Amazon guidelines
38 CherryPick corrects Amazon guidelines Machine type AWS suggests R3,I2,M4 as good options for running TPC-DS CherryPick found that at our scale C4 is the best. Cluster size AWS has no suggestion Choosing a bad cluster size can have 3 times higher cost than choosing a right cluster config. Combining the two becomes even harder.

38 Conclusions Black-box modeling: Adaptivity
39 Conclusions Adaptivity Black-box modeling: Requires running the cloud configuration. Low overhead Restricted amount of information: Only a few runs available. “... do not solve a more general problem try to get the answer that you really need” —Vladimir Vapnik High accuracy

39 https://github.com/yale-cns/cherrypick
40 Thanks! Please try our tool at:


Download ppt "CherryPick: Adaptively Unearthing the Best"

Similar presentations


Ads by Google