CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics 如何自适应地为大数据分析任务推荐最佳的云配置 Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang
Recurring jobs are popular 2 2 Apps & Data Frameworks 现在云端上的周期性重复的大数据任务非常普遍,根据微软的统计必应上40%的key job都是周期性重复的作业 CherryPick的面向的场景是对于周期性的重复性作业推荐合适的云配置 因为CherryPick会进行多次运行迭代逼近较优方案,因此需要任务本身会周期运行。同时据统计40%的作业都是周期性重复作业,所以只做重复性作业也是有价值的。 Recurring Microsoft reports 40% of key jobs at Bing are rerun periodically.
n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ... Providers Machine Type 3 Cluster Size r3.8xlarge, i2.8xlarge, m4.8xlarge, c4.8xlarge, r4.8xlarge, m3.8xlarge, … 10s~ options A0, A1, A2, A3, A4, A11, Hundreds of instance types and instance count combinations X X A12, D1, D2, D3, L4s, … n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ... + configurable VMs
Choosing a good configuration is important 4 Choosing a good configuration is important Better performance: - For the same cost: best/worst running time is up to 3x - Worst case has good CPUs whereas the memory is bottlenecked. Lower cost: For the same performance: best/worst cost is up to 12x No need for expensive dedicated disks in the worst config. 2$ vs. 24$ per job with 100s of monthly runs
for a recurring job, given its representative workload? 5 How to find the best cloud configuration One that minimizes the cost given a performance constraint for a recurring job, given its representative workload?
Close the optimal configurations Works across all big-data apps 6 High Accuracy Adaptivity Close the optimal configurations Works across all big-data apps Key Challenges Modeling Searching Low Overhead Only runs a few configuration
Existing solution: modeling 8 Existing solution: modeling Modeling the resource-perf/cost tradeoffs Ernest [NSDI 16] models machine learning apps for each machine type Problem: not adaptive Frameworks (e.g., MapReduce or Spark) Applications (e.g., machine learning or database) Machine type (e.g., memory vs CPU intensive)
Existing solution: searching 7 Existing solution: searching Exhaustively search for the best cloud configuration without relying on an accurate performance Problem: high overhead If you choose a sample not carefully ,may end up needing tens if not hundreds of runs to identify the best instance.
Exhaustive High overhead 9 Exhaustive High overhead High Accuracy Adaptivity CherryPick Modeling Ernest Not general Searching Coord. Descent Not accurate Low Overhead
Key idea of CherryPick - Accuracy: modeling for ranking configurations 10 Key idea of CherryPick - Adaptivity: black-box modeling - Without knowing the structure of each application - Accuracy: modeling for ranking configurations - No need to be accurate everywhere - Low overhead: interactive searching - Smartly select next run based on existing runs
Workflow of CherryPick 11 Workflow of CherryPick Start with any config Run the config Black-box Modeling Rank and choose next config Return config Interactive search
Workflow of CherryPick 12 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search
Workflow of CherryPick 13 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search
Workflow of CherryPick 14 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search
Workflow of CherryPick 15 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search
Workflow of CherryPick 16 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search 𝐶 𝑋 =𝑃 𝑋 ×𝑇 𝑋 𝑃 𝑋 :配置X单位时间的花费 𝑇 𝑋 :任务运行在配置为X的云上的时间
- Bayesian optimization is good at handling additive noise 32 Noises in the cloud - Noises are common in the cloud Shared environment means inherent noise from other tenants Even more noisy under failures, stragglers Strawman solution: run multiple times, but high overhead. - - Bayesian optimization is good at handling additive noise - Merge the noise in the confidence interval (learned by monitoring or historical data) Noise
Workflow of CherryPick 17 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function
Workflow of CherryPick 18 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function
Workflow of CherryPick 19 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function
Prior function for black box modeling 20 Prior function for black box modeling Feature vector: CPU/RAM/Machine count/… This is the actual cost function curve across all configurations.
Prior function for black box modeling 21 Prior function for black box modeling Challenge: what can we infer about configurations with two runs?
Prior function for black box modeling 22 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? There are many valid functions passing through two points
Prior function for black box modeling 23 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? Solution: confidence intervals capture where the function likely lies
Workflow of CherryPick 24 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function
Acquisition function for choosing the next config 25 Acquisition function for choosing the next config Build an acquisition function based on the prior function Calculate the expected improvement in comparison to the current best configuration
Acquisition function for choosing the next config 26 Acquisition function for choosing the next config Highest expected improvement
Acquisition function for choosing the next config 27 Acquisition function for choosing the next config It then runs the new configuration.
28 Iterative searching And update the model.
CherryPick searches in the area that matters 29 CherryPick searches in the area that matters A database application with 66 configurations
CherryPick searches in the area that matters 30 CherryPick searches in the area that matters A database application with 66 configurations
CherryPick searches in the area that matters 31 CherryPick searches in the area that matters A database application with 66 configurations
Further customizations 34 Further customizations - Discretize features: Deal with infeasible configs and large searching space Discretize the feature space - Stopping condition: Trade-off between accuracy and searching cost Use the acquisition function knob - Starting condition: Should fully cover the whole space Use quasi-random search
Evaluation settings 5 big-data benchmarks 66 cloud configurations 35 Evaluation settings 5 big-data benchmarks 66 cloud configurations Database: TPC-DS TPC-H MapReduce: TeraSort Machine Learning: SparkML Regression SparkML Kmeans 30 GB-854 GB RAM - 12-112 cores 5 machine types
Evaluation Settings Metrics 36 Evaluation Settings Metrics Accuracy: running cost compared to the optimal configuration Overhead: searching cost of suggesting a configuration Comparing with Searching: random search, coordinate descent Modeling: Ernest [NSDI’16]
CherryPick has high accuracy with low overhead Not accurate normalized by opt. Config. (%) Running cost 175 150 Coord. search has 20% more searching cost in median and 37 78% higher running cost in the Random search Random sear tail for TPC-DS 125 Coordinate search 100 Not adaptable Ernest CherryPick 3.7 times searching cost 100 150 200 250 300 350 400 Searching cost normalized by CherryPick’s median (%) TPC-DS
CherryPick corrects Amazon guidelines 38 CherryPick corrects Amazon guidelines Machine type AWS suggests R3,I2,M4 as good options for running TPC-DS CherryPick found that at our scale C4 is the best. Cluster size AWS has no suggestion Choosing a bad cluster size can have 3 times higher cost than choosing a right cluster config. Combining the two becomes even harder.
Conclusions Black-box modeling: Adaptivity 39 Conclusions Adaptivity Black-box modeling: Requires running the cloud configuration. Low overhead Restricted amount of information: Only a few runs available. “... do not solve a more general problem ... try to get the answer that you really need” —Vladimir Vapnik High accuracy
https://github.com/yale-cns/cherrypick 40 Thanks! Please try our tool at: https://github.com/yale-cns/cherrypick