Learning Application Models for Utility Resource Planning Piyush Shivam, Shivnath Babu, Jeff Chase Duke University IEEE International Conference on Autonomic.

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
SLA-Oriented Resource Provisioning for Cloud Computing
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Ira Cohen, Jeffrey S. Chase et al.
Introduction CSCI 444/544 Operating Systems Fall 2008.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Energy Management and Adaptive Behavior Tarek Abdelzaher.
VLSI Systems--Spring 2009 Introduction: --syllabus; goals --schedule --project --student survey, group formation.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 5 Making Systematic Observations.
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Automated Workload Management in.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
CloudCmp: Shopping for a Cloud Made Easy Ang Li Xiaowei Yang Duke University Srikanth Kandula Ming Zhang Microsoft Research 6/22/2010HotCloud 2010, Boston1.
New Challenges in Cloud Datacenter Monitoring and Management
1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Computers on Cruise Control Creating Adaptive Systems with Control Theory Ricardo Portillo The University of Texas at El Paso
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Profiling and Modeling Resource Usage.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Automated Control in Cloud Computing: Challenges and Opportunities Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh ACM’s First Workshop.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS Pre-GDB 2014: Christian Nieke1.
Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.
Self-Managing Cost Models Shivnath Babu Stanford University.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.
WERST – Methodology Group
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Dynamic Placement of Virtual Machines for Managing SLA Violations NORMAN BOBROFF, ANDRZEJ KOCHUT, KIRK BEATY SOME SLIDE CONTENT ADAPTED FROM ALEXANDER.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Assess usability of a Web site’s information architecture: Approximate people’s information-seeking behavior (Monte Carlo simulation) Output quantitative.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
HPHC - PERFORMANCE TESTING Dec 15, 2015 Natarajan Mahalingam.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
1
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Performance Assurance for Large Scale Big Data Systems
Software Architecture in Practice
Measurement-based Design
Applying Control Theory to Stream Processing Systems
Grid Computing.
Chapter 10 Verification and Validation of Simulation Models
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Kostas Kolomvatsos, Christos Anagnostopoulos
Presentation transcript:

Learning Application Models for Utility Resource Planning Piyush Shivam, Shivnath Babu, Jeff Chase Duke University IEEE International Conference on Autonomic Computing (ICAC), June 2006

C3C3 C1C1 C2C2 Site A Site B Site C Task scheduler Task workflow A network of clusters or grid sites. Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network) Managed as a shared utility. Jobs are task/data workflows. Challenge: choose the ‘best’ resource mapping/schedule for the job mix. Instance of “utility resource planning”. Solution under construction: NIMO Networked Computing Utility

Self-Managing Systems Monitor, Analyze, Plan, & Execute (“MAPE”) Decentralized, interacting control loops Workload Data Analysis, learning Self-optimization policy Infrastructure You are here ! sensors actuators Pervasive instrumentation What control points to expose?

It’s a Hard Problem Diverse applications –Applications range from online web services to batch compute tasks. –Each type of application has different resource demands. Heterogeneous resources –Performance can vary significantly across candidate resource assignments. –Network I/O: compute power vs. locality/access Multiple objectives –Job performance vs. overall performance vs. profit –“Urgent computing” or “on-time computing” We deal with hard problems by constraining them.

Example: task and data placement 1 C2C2 c1c1 Site A Site B C3C3 Site C home file server 2 What is the application performance on each of the alternative candidate assignments? Which has the fastest completion time for this application? Which is best overall? Remote access 3 Data staging

Premises (Limitations) Important batch applications are run repeatedly. –Most resources are consumed by applications we have seen in the past. Behavior is predictable across data sets. –…given some attributes associated with the data set. –Stable behavior per unit of data processed (D) –D is predictable from data set attributes. Behavior depends only on resource attributes. –CPU type and clock, seek time, spindle count. Utility controls the resources assigned to each job. –Virtualization enables precise control. Your mileage may vary.

NIMO NonInvasive Modeling for Optimization) NIMO learns end-to-end performance models –Models predict performance as a function of, (a) application profile, (b) data set profile, and (c) resource profile of candidate resource assignment NIMO is active –NIMO collects training data for learning models by conducting proactive experiments on a ‘workbench’ NIMO is noninvasive App/data profiles (Target) performance Candidate resource profiles Model “What if…”

Application profiler Training set database Active learning C3C3 C1C1 C2C2 Site A Site B Site C Scheduler Resource profiler The Big Picture Jobs, benchmarks Pervasive instrumentation Correlate metrics with job logs

Pure Statistical Inference Queuing Models Fully general Automated learning Good for interpolation Good data yields good results A priori knowledge Harder to learn Good for interpolation and extrapolation Might be wrong You are here What are the right predictive models? –Easy to learn, efficient, and sufficiently accurate Spectrum of approaches NIMO Models

Generic End-to-End Model compute phases (compute resource busy) stall phases (compute resource stalled on I/O) O d (storage occupancy) O n (network occupancy) ++ ) ( T=D * total data comp. time O a (compute occupancy) O s (stall occupancy) occupancy: average time consumed per unit of data

Occupancy Predictors NIMO learns predictor functions for D, O a, O n, O d Limited assumption of separability/independence –Each predictor function takes complete profiles as inputs. Resource profile Data profile Oa Oa fafa

Goal: Learn Those Functions For each deployed run on a resource assignment - Obtain CPU utilizations of compute resource - Gather time-stamped network I/O traces Tuple/Sample:

Independent variables Dependent variables Resource profile ( ) Data profile ( ) Statistical Learning Complexity (e.g., latency hiding, concurrency, arm contention) is captured implicitly in the training data rather than in the structure of the model.

NIMO Framework Application profiler Resource profiler Data profiler WAN emulator (nistnet) NIMO workbench Run standard benchmarks 1.Setup resource assignments 2.Run application 3.Collect instr. data 4.Learn predictors Training set database Active learning Data attributes ss

Methodology 4 scientific applications used in biomedical research 6 synthetic applications to explore more behaviors 50 resource assignments –5 different CPU speeds (450 MHz – 1.4 GHz) –10 network latencies (0ms – 18ms) between compute and storage Learned the predictors using around 14 assignments – Predict the completion time for remaining assignments. Applications are sequential tasks that scan/transform their inputs.

Validation 50-way cross validation and three metrics to test the accuracy of model-predicted completion times –Absolute and relative error in model predicted completion time –Ranking of all the assignments Mean, SD, 90 percentile, and worst case error statistics Also, rank assignments for utility planning scenarios –Task placement choices –Storage outsourcing and data staging decisions –Predicting resources to meet a deadline

Results Summary Key results are Table II. Mean accuracy (1-PE) between 90 – 100%. –Most are 95-99%. –Train with 20-30% of the candidate assignments. –Ranking error is minimal. Accurately captures conclusions of 2002 empirical study of storage outsourcing [Ng]. And yet: some workloads show varying accuracy. –Worst case error is 30% for synthetic seq I/O. –(Though accurate most of the time.) –Stems from nonlinear latency-hiding behavior.

Latency hiding

Planning: task placement fMRI performance depends on combination of compute and I/O resources Model predicted best: Site C (19.5 mins) Actual best: Site C (18.8 mins) Fastest CPU or network not always the best

Active Learning in NIMO Passive sampling Active sampling Number of training samples Accuracy of current model 100% Passive sampling might not expose the system operating range Active sampling using “design of experiments” collects most relevant training data Automatic and quick How to learn accurate models quickly? Active Sampling for Accelerated Learning of Performance Models, by P. Shivam, S. Babu, J. Chase. First Workshop on Tackling Computer Systems Problems with Machine Learning (SysML), June 2006, + VLDB 2006

Active Learning

Conclusions Active learning of performance models from noninvasive instrumentation data Simple regression is surprisingly good. –Good enough to rank app’s candidates by runtime. But: –Sensitive to sample selection in the corner cases Active learning is crucial [VLDB06] –Can yield significant errors for non-linear behavior related to concurrency and latency-hiding. –More sophisticated learning algorithms may help.

“Future Work” - Applications with data-dependent behavior Data profiling –Wider range of applications, including parallel –Incorporate a priori models (e.g., queuing models) Necessary? –Explore more sophisticated learning algorithms for the interesting (nonlinear) cases. –Integrate NIMO as policies for resource leasing infrastructure (e.g., Shirako [USENIX06]).

Saturation

We have data I’m sure the information is here… Somewhere Somewhere.. Redundant irrelevant

Data staging task Actual task T = t 1 + t 2 Planning: task and data placement Model predicted best: Site C (19.5 mins) Actual best: Site C (18.8 mins)

Experimental Evaluation Applied NIMO to do accurate resource planning –Several real and synthetic batch applications –Heterogeneous compute and network resources Planning Scenarios –Task placement choices –Storage outsourcing and data staging decisions –Predicting resources that meet a target performance NIMO learned accurate models using only 10-25% of the total training samples

Future Directions Integrate NIMO within an on-demand resource brokering architecture, e.g., Cereus [USENIX 2006] Investigate the continuum of models Apply NIMO to a wider variety of applications and resources

Planning: task and data placement 1 C2C2 c1c1 Site A Site B C3C3 Site C home file server 2 Goal: best performance 3 Copy data

Other Scenarios Viability of storage outsourcing Candidates that meet a target completion time Details in ICAC 2006

C3C3 C1C1 C2C2 Site A Site B Site C home file server P1 P2 P3 Goal: Placing an application and its data on utility resources Candidate resource plans: 1.Plan P1. Place the compute task at Site A next to where the data is 2.Plan P2. Run the task at a remote site B with faster compute resources and access data remotely 3.Plan P3. Run the task at site B after copying data from Site A to Site B 4.Plans P2 and P3 for Site C

NonInvasive Modeling for Optimization NIMO learns end-to-end performance models for applications. NIMO collects training data with active “experiments”. NIMO uses noninvasive instrumentation to collect training data. Application profile (Target) performance Candidate resource profiles Model

Application profile Performance Candidate resource profiles Model Candidate resource profiles Application profile Target performance Model

Elements of NIMO Active learning of predictive models using noninvasive instrumentation –End-to-end Predict performance as a function of compute, network, and storage resources assigned to an application –Active Deploy and monitor applications in a `workbench’ to collect training data –Noninvasive Training data gathered from passive instrumentation streams No change to application software or operating systems

Application profiler Learns functions that predict an application’s performance on a given assignment of resources Learned by applying statistical learning techniques to performance history of application Performance history collected proactively by planning runs on a workbench with varying assignments of compute, network and storage resources

Outline Model-guided approach Active learning of models Model validation Model-guided planning Conclusions and future work

Completion Time Formulate completion time in terms of predictor functions

Model-guided Approach Application characteristics Completion Time Candidate resource assignment Model Input data set

Active and Noninvasive For each deployed run on a resource assignment –Passive instrumentation data CPU utilization of compute resource Time stamped network I/O traces –Use CPU utilization and network I/O traces to obtain D, O a, O n, O d –Combined with the resource profile, and data profile Tuple/Sample:

Capturing complexity Complexity captured implicitly in the training data rather than in the structure of the model –Simple: –Complexity: Latency hiding, queuing, concurrency –Advantage: Easy to learn model parameters –Challenge: Need to get the “right” training data

Active Learning Need “right training data” to learn the “right model” –Cover system operating range –Capture main factors and interactions Theory of design of experiments –Expose the system operating range, factors, and interactions Active sampling from machine learning –Pick the next sample to maximize accuracy of current model [SysML 2006] Minimize the time/samples taken to learn an accurate model

Outline Model-guided approach Active learning of models Model validation Model-guided planning Conclusions and future work

Planning: task placement 1 C2C2 c1c1 Site A Site B C3C3 Site C home file server 2 Goal: best performance 3

Summary Holistic view of network and edge resources for networked application services. –NSF GENI initiative ( “Autonomic” management –Sense and respond –Optimizing control loops based on learned models Large-scale systems involve a factoring of control functions and policies across multiple actors. –Emergent behavior –Incentives and secure, accountable control