Aaron Harlap, Alexey Tumanov*, Andrew Chung, Greg Ganger, Phil Gibbons

Slides:

Advertisements

Similar presentations

Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

B. Ramamurthy 4/17/ Overview of EC2 Components (fig. 2.1) 10..* /17/20152.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.

Low Cost, Scalable Proteomics Data Analysis Using Amazon's Cloud Computing Services and Open Source Search Algorithms Brian D. Halligan, Ph.D. Medical.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 4.

Engineering the Cloud Andrew McCombs March 10th, 2011.

WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.

EA and IT Infrastructure - 1© Minder Chen, Stages in IT Infrastructure Evolution Mainframe/Mini Computers Personal Computer Client/Sever Computing.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Introduction to Cloud Computing

Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Software Project Management With Usage of Metrics Candaş BOZKURT - Tekin MENTEŞ Delta Aerospace May 21, 2004.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

How AWS Pricing Works Jinesh Varia Technology Evangelist.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Database replication policies for dynamic content applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto Presented by Ahmed.

Towards Economic Fairness for Big Data Processing in Pay-as-you-go Cloud Computing Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Shanjiang Tang, Bu-Sung Lee, Bingsheng He, Haikun Liu School of Computer Engineering Nanyang Technological University Long-Term Resource Fairness Towards.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.

Chapter 8 – Cloud Computing

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.

Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Welcome To We have registered over 5,000 domain names and host over 1,500 cloud servers for individuals and organizations, Our fast and reliable.

Cloud Computing % of us use some form of cloud coumputing.

Agenda  What is Cloud Computing?  Milestone of Cloud Computing  Common Attributes of Cloud Computing  Cloud Service Layers  Cloud Implementation.

HPC In The Cloud Case Study: Proteomics Workflow

Business System Development

Chapter 6: Securing the Cloud

TensorFlow– A system for large-scale machine learning

Organizations Are Embracing New Opportunities

100% Exam Passing Guarantee & Money Back Assurance

Designing and Implementing an ETL Framework

Smart Building Solution

Introduction to Distributed Platforms

N-Tier Architecture.

Improving searches through community clustering of information

Cloud Computing & ANalytics

Smart Building Solution

McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.

Andrew McCombs March 10th, 2011

Simulating Processes Motivation

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

PA an Coordinated Memory Caching for Parallel Jobs

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Replication Middleware for Cloud Based Storage Service

Agile Elasticity in ML: How to do it and how to take advantage of it

Be Fast, Cheap and in Control

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

Omega: flexible, scalable schedulers for large compute clusters

Service Oriented Architecture for Cloud Based Travel Reservation Software as a Service Comp 684 – Rayna Burgess.

Mélange: Multi-tenant Scheduling for Graph Processing Jobs

AWS Cloud Computing Masaki.

Declarative Transfer Learning from Deep CNNs at Scale

Cloud Computing: Concepts

Cloud Computing Large-scale Resource Management

Lecture 03.5: Cloud Computing ( SAAS )

NTC/302 NETWORK WEB SERVICES The Latest Version // uopcourse.com

NTC/302 NTC/ 302 ntc/302 ntc/ 302 NETWORK WEB SERVICES The Latest Version // uopstudy.com

NTC/302 NETWORK WEB SERVICES The Latest Version NTC 302 Entire Course Link

Presentation transcript:

Aaron Harlap, Alexey Tumanov*, Andrew Chung, Greg Ganger, Phil Gibbons Proteus: agile ML elasticity through tiered reliability in dynamic resource markets So today I am going to be talking about agile elasticity in ML. How to achieve agile elasticity and how to take advantage of it. Aaron Harlap, Alexey Tumanov*, Andrew Chung, Greg Ganger, Phil Gibbons

Overview Motivation for elasticity in ML How to make Parameter Servers Elastic AgileML How to take advantage of Elasticity BidBrain Evaluations of cost benefits and runtime benefits of elasticity for ML I want to start this talk with a quick overview. First at the begging of this talk I am going to talk about some motivations for why we want elasticity. Then I am going to describe to you a component of our system called AgileML which modified the traditional parameter server architecture to make it efficiently elastic. Then I am going to describe a system called BidBrain which is designed to take advantage of elasticity by intelligently bidding on the Amazon EC2 spot market. Lastly at the end of this talk I will go over some evaluation results showing how valuable elasticity is.

Dynamic Resource Availability Revocable resources are common in clusters Best effort resource that can be preempted Yarn, Borg, Mesos, etc… Adding the element of cost savings in clouds Preemptible Instances in Google Compute Engine Spot Instances in Amazon EC2 First I want to provide some motivation for why we need elasticity in our systems. Many popular schedulers have such as Yarn and Messos have shown that they significantly improve their cluster utilization and ability to meet SLAs if they are able to revoke resources. The example that I am going to focus on in this talk is the Amazon EC2 spot market, where users can use cheaper resources with the risk of revocation, which we call eviction.

Big $$$ Saving Often 75-85% cheaper to use Spot Instances Low Cost what amazon calls the on-demand price After red describe how cheap it is and explain spot market rules after blue uncorrelated markets - move indecently nickels on the dollar Low Cost

How do you Save $$$ Support agile elasticity Scale in and out efficiently and quickly Handle bulk revocations/evictions efficiently Don’t lose progress So now that we have seen just how much we can save the question becomes how do we do this. The answer is that we need systems that are able to perform efficient elasticity, handle losing large % of their resources efficiently, not lose forward progress, and preferably do all this with no overhead. Mention this is for ML

Parameter Servers are Great for Iterative ML Parameter Servers shard solution state across machines Traditional architecture has servers and workers on all machines Efficient This diagram shows the high level architecture of the traditional PS framework. The workers read in the data, and then concurrently read and update the solution state stored in the parameter server. Current architectures traditionally have workers running on all the machines, and shard the Parameter across all the machines also. IterStore, Petuum, MxNet Used by IterStore, MXNet, Bosen …

AgileML: New Approach to Elasticity Use tiers of reliable and un-reliable resources Revocable resources are un-reliable (transient) Maintain all state on reliable resources E.g Parameter Servers only on On-demand Instances Spot Instances run workers only (initially) 3 architecture stages based on ratio of transient to reliable resources So we propose a new approach to achieving elasticity called TierML. TireML is a system that is designed to take advantage of tiers of reliability of its resources in order to achieve agile elasticity. It does so by running statefull processes only on reliable on-demand machines, and using unreliable spot machines to only run workers, which are stateless.

Building the Stages of Reliability Elasticity Controller Spot Instances (Cheap) Worker Stage #2 Stage #3 ParamServ On-Demand Instances (Reliable) Worker At the end Shown you how to design an elastic PS ML system with low overhead, now I want to show you just how low that overhead is Transition between stages at run-time Little/No overhead for transitions Transitions based on ratios

So now we have Agile Elasticity How do we take advantage of it? So not that we have build TierML, a system that performs agile elasticity, is able to use unreliable resources, and do these things at very little overhead, how do we take advantage of it?

Proteus Implementation 1 1) Application Characteristics Spot Market Historical Data 2 2) Feed Historic Spot Market into BidBrian BidBrain AgileML 6 6) BidBrain provides AgileML with resources 3 3) Feed Spot Market Price into BidBrian 4 4) BidBrain makes allocation request 5 5) AWS provides resources to BidBrain Amazon EC2 Slow down for 4,5,6

Goal is to Minimize Cost Per Work Computes expected cost of a set of resources Computes expected work produced by a set of resources Minimizes expected cost per work

Expected Time to Eviction Compute Expected Cost Current Market Price Historic Market Price Bid Delta = Bid Price - Market Price c4.2xlarge instance type in zone us-east-1a: This historic data is designed to provide BidBrain with an estimation about the reliability of resources depending on how much over the market price it decides to bid. Mention Ideal Case If you to take advantage of this free computing, you need a system like TierML that can handle evictions, now i will describe how midbrain decides how well the application will handle aggressive bidding Bid Delta Evicted within Hour Expected Time to Eviction $0.0005 55% 42 Min $0.01 5.5% 738 Min

Compute Expected Work AgileML provides this information to BidBrain how long after startup do resources become productive Scalability Scale in/out overhead Eviction overhead

Proteus Evaluation AgileML vs Checkpointing BidBrain vs Bid On-Demand Policy Bid On-Demand Policy (standard) Choose cheapest resource Bid On-demand Price (user bid = on-demand price) On eviction repeat

Proteus Saves Money and Time Bid-on-demand + CKPts Proteus (BidBrain+ AgileML) Proteus (BidBrain+ AgileML) Bid-on-demand + CKPts

Need Elasticity and Smart Resource Manager Standard +CKPts BidBrain Proteus BidBrain+ AgileML Proteus BidBrain+ AgileML Standard +CKPts Standard +AgileML

Summary Proteus uses agile elastic ML system (AgileML) + smart bidding (BidBrain) take advantage of dynamic resource availability ~85% cost saving compared to on-demand resources!