CherryPick: Adaptively Unearthing the Best

Slides:

Advertisements

Similar presentations

Lazy Paired Hyper-Parameter Tuning

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

Traffic Engineering with Forward Fault Correction (FFC)

LIBRA: Lightweight Data Skew Mitigation in MapReduce

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Spark: Cluster Computing with Working Sets

Minimizing Multi-Hop Wireless Routing State under Application- based Accuracy Constraints Mustafa Kilavuz & Murat Yuksel University of Nevada, Reno.

Look Who’s Talking: Discovering Dependencies between Virtual Machines Using CPU Utilization HotCloud 10 Presented by Xin.

Reciprocal Resource Fairness: Towards Cooperative Multiple-Resource Fair Sharing in IaaS Clouds School of Computer Engineering Nanyang Technological University,

Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

FLANN Fast Library for Approximate Nearest Neighbors

The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.

The Power of Choice in Data-Aware Cluster Scheduling

CloudCmp: Shopping for a Cloud Made Easy Ang Li Xiaowei Yang Duke University Srikanth Kandula Ming Zhang Microsoft Research 6/22/2010HotCloud 2010, Boston1.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

HOW TO BID THE CLOUD Liang Zheng, Carlee Joe-Wong, Chee Wei Tan, Mung Chiang, and Xinyu Wang SIGCOMM 2015 | London, UK | August

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

CS120 Purchasing a Computer

Makes Insurance Smarter.

Packing Tasks with Dependencies

Curator: Self-Managing Storage for Enterprise Clusters

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Measurement-based Design

| A Comparative Study on I/O Performance between Compute and Storage Optimized Instances of Amazon.

Data Mining – Algorithms: Instance-Based Learning

Running virtualized Hadoop, does it make sense?

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Line Fitting James Hayes.

Scalable Performance Tuning of Hadoop MapReduce: A Noisy Gradient Approach Sandeep Kumar, Sindhu Padakandla, Chandrashekar L, Priyank Parihar, Gopinath.

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

Sebastian Solbach Consulting Member of Technical Staff

Resource Elasticity for Large-Scale Machine Learning

One vs. two production environments

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Understanding Results

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Drum: A Rhythmic Approach to Interactive Analytics on Large Data

The Small batch (and Other) solutions in Mantle API

Selectivity Estimation of Big Spatial Data

Machine Learning for Systems: OtterTune and CherryPick

Extending BUNGEE Elasticity Benchmark for Multi-Tier Cloud Applications - Talk - André Bauer.

Announcements Homework 3 due today (grace period through Friday)

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Objective of This Course

INFO 344 Web Tools And Development

CS110: Discussion about Spark

IBM Power Systems.

Overfitting and Underfitting

Creative Activity and Research Day (CARD)

Actively Learning Ontology Matching via User Interaction

Wellington Cabrera Advisor: Carlos Ordonez

H2O is used by more than 14,000 companies

Calibration and homographies

The Gamma Operator for Big Data Summarization on an Array DBMS

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

NitroSketch: Robust and General Sketch-based Monitoring in Software Switches Alan (Zaoxing) Liu Joint work with Ran Ben-Basat, Gil Einziger, Yaron Kassner,

Presentation transcript:

CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics 如何自适应地为大数据分析任务推荐最佳的云配置 Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang

Recurring jobs are popular 2 2 Apps & Data Frameworks 现在云端上的周期性重复的大数据任务非常普遍，根据微软的统计必应上40%的key job都是周期性重复的作业 CherryPick的面向的场景是对于周期性的重复性作业推荐合适的云配置因为CherryPick会进行多次运行迭代逼近较优方案，因此需要任务本身会周期运行。同时据统计40％的作业都是周期性重复作业，所以只做重复性作业也是有价值的。 Recurring Microsoft reports 40% of key jobs at Bing are rerun periodically.

n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ... Providers Machine Type 3 Cluster Size r3.8xlarge, i2.8xlarge, m4.8xlarge, c4.8xlarge, r4.8xlarge, m3.8xlarge, … 10s~ options A0, A1, A2, A3, A4, A11, Hundreds of instance types and instance count combinations X X A12, D1, D2, D3, L4s, … n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ... + configurable VMs

Choosing a good configuration is important 4 Choosing a good configuration is important Better performance: - For the same cost: best/worst running time is up to 3x - Worst case has good CPUs whereas the memory is bottlenecked. Lower cost: For the same performance: best/worst cost is up to 12x No need for expensive dedicated disks in the worst config. 2$ vs. 24$ per job with 100s of monthly runs

for a recurring job, given its representative workload? 5 How to find the best cloud configuration One that minimizes the cost given a performance constraint for a recurring job, given its representative workload?

Close the optimal configurations Works across all big-data apps 6 High Accuracy Adaptivity Close the optimal configurations Works across all big-data apps Key Challenges Modeling Searching Low Overhead Only runs a few configuration

Existing solution: modeling 8 Existing solution: modeling Modeling the resource-perf/cost tradeoffs Ernest [NSDI 16] models machine learning apps for each machine type Problem: not adaptive Frameworks (e.g., MapReduce or Spark) Applications (e.g., machine learning or database) Machine type (e.g., memory vs CPU intensive)

Existing solution: searching 7 Existing solution: searching Exhaustively search for the best cloud configuration without relying on an accurate performance Problem: high overhead If you choose a sample not carefully ,may end up needing tens if not hundreds of runs to identify the best instance.

Exhaustive High overhead 9 Exhaustive High overhead High Accuracy Adaptivity CherryPick Modeling Ernest Not general Searching Coord. Descent Not accurate Low Overhead

Key idea of CherryPick - Accuracy: modeling for ranking configurations 10 Key idea of CherryPick - Adaptivity: black-box modeling - Without knowing the structure of each application - Accuracy: modeling for ranking configurations - No need to be accurate everywhere - Low overhead: interactive searching - Smartly select next run based on existing runs

Workflow of CherryPick 11 Workflow of CherryPick Start with any config Run the config Black-box Modeling Rank and choose next config Return config Interactive search

Workflow of CherryPick 12 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

Workflow of CherryPick 13 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

Workflow of CherryPick 14 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

Workflow of CherryPick 15 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search

Workflow of CherryPick 16 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search 𝐶 𝑋 =𝑃 𝑋 ×𝑇 𝑋 𝑃 𝑋 :配置X单位时间的花费 𝑇 𝑋 :任务运行在配置为X的云上的时间

- Bayesian optimization is good at handling additive noise 32 Noises in the cloud - Noises are common in the cloud Shared environment means inherent noise from other tenants Even more noisy under failures, stragglers Strawman solution: run multiple times, but high overhead. - - Bayesian optimization is good at handling additive noise - Merge the noise in the confidence interval (learned by monitoring or historical data) Noise

Workflow of CherryPick 17 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function

Workflow of CherryPick 18 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function

Workflow of CherryPick 19 Workflow of CherryPick Bayesian Optimization Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function

Prior function for black box modeling 20 Prior function for black box modeling Feature vector: CPU/RAM/Machine count/… This is the actual cost function curve across all configurations.

Prior function for black box modeling 21 Prior function for black box modeling Challenge: what can we infer about configurations with two runs?

Prior function for black box modeling 22 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? There are many valid functions passing through two points

Prior function for black box modeling 23 Prior function for black box modeling Challenge: what can we infer about configurations with two runs? Solution: confidence intervals capture where the function likely lies

Workflow of CherryPick 24 Workflow of CherryPick Run the config Black-box Modeling Rank and choose next config Return config Interactive search Prior function Acquisition function

Acquisition function for choosing the next config 25 Acquisition function for choosing the next config Build an acquisition function based on the prior function Calculate the expected improvement in comparison to the current best configuration

Acquisition function for choosing the next config 26 Acquisition function for choosing the next config Highest expected improvement

Acquisition function for choosing the next config 27 Acquisition function for choosing the next config It then runs the new configuration.

28 Iterative searching And update the model.

CherryPick searches in the area that matters 29 CherryPick searches in the area that matters A database application with 66 configurations

CherryPick searches in the area that matters 30 CherryPick searches in the area that matters A database application with 66 configurations

CherryPick searches in the area that matters 31 CherryPick searches in the area that matters A database application with 66 configurations

Further customizations 34 Further customizations - Discretize features: Deal with infeasible configs and large searching space Discretize the feature space - Stopping condition: Trade-off between accuracy and searching cost Use the acquisition function knob - Starting condition: Should fully cover the whole space Use quasi-random search

Evaluation settings 5 big-data benchmarks 66 cloud configurations 35 Evaluation settings 5 big-data benchmarks 66 cloud configurations Database: TPC-DS TPC-H MapReduce: TeraSort Machine Learning: SparkML Regression SparkML Kmeans 30 GB-854 GB RAM - 12-112 cores 5 machine types

Evaluation Settings Metrics 36 Evaluation Settings Metrics Accuracy: running cost compared to the optimal configuration Overhead: searching cost of suggesting a configuration Comparing with Searching: random search, coordinate descent Modeling: Ernest [NSDI’16]

CherryPick has high accuracy with low overhead Not accurate normalized by opt. Config. (%) Running cost 175 150 Coord. search has 20% more searching cost in median and 37 78% higher running cost in the Random search Random sear tail for TPC-DS 125 Coordinate search 100 Not adaptable Ernest CherryPick 3.7 times searching cost 100 150 200 250 300 350 400 Searching cost normalized by CherryPick’s median (%) TPC-DS

CherryPick corrects Amazon guidelines 38 CherryPick corrects Amazon guidelines Machine type AWS suggests R3,I2,M4 as good options for running TPC-DS CherryPick found that at our scale C4 is the best. Cluster size AWS has no suggestion Choosing a bad cluster size can have 3 times higher cost than choosing a right cluster config. Combining the two becomes even harder.

Conclusions Black-box modeling: Adaptivity 39 Conclusions Adaptivity Black-box modeling: Requires running the cloud configuration. Low overhead Restricted amount of information: Only a few runs available. “... do not solve a more general problem ... try to get the answer that you really need” —Vladimir Vapnik High accuracy

https://github.com/yale-cns/cherrypick 40 Thanks! Please try our tool at: https://github.com/yale-cns/cherrypick