1 https://sites.google.com/site/acceleratingexascale.

Slides:



Advertisements
Similar presentations
Chapter 9. Performance Management Enterprise wide endeavor Research and ascertain all performance problems – not just DBMS Five factors influence DB performance.
Advertisements

Introduction Background Knowledge Workload Modeling Related Work Conclusion & Future Work Modeling Many-Task Computing Workloads on a Petaflop Blue Gene.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
DV/dt : Accelerating the Rate of Progress Towards Extreme Scale Collaborative Science Miron Livny (UW), Ewa Deelman (USC/ISI), Douglas Thain (ND), Frank.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Performance Evaluation
Workload Management Massimo Sgaravatto INFN Padova.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.
MCTS Guide to Microsoft Windows Vista Chapter 11 Performance Tuning.
MCTS Guide to Microsoft Windows 7
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Reliability SEG3202 N. El Kadri.
Emalayan Vairavanathan
Traffic Modeling.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Profile Driven Component Placement for Cluster-based Online Services Christopher Stewart (University of Rochester) Kai Shen (University of Rochester) Sandhya.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CE Operating Systems Lecture 7 Threads & Introduction to CPU Scheduling.
© 2003, Carla Ellis Self-Scaling Benchmarks Peter Chen and David Patterson, A New Approach to I/O Performance Evaluation – Self-Scaling I/O Benchmarks,
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Workload Management Workpackage
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Software Architecture in Practice
Seismic Hazard Analysis Using Distributed Workflows
MCTS Guide to Microsoft Windows 7
Resource Elasticity for Large-Scale Machine Learning
William Stallings Computer Organization and Architecture
OPERATING SYSTEM OVERVIEW
CSCI1600: Embedded and Real Time Software
University of Southern California
ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC
CSCI1600: Embedded and Real Time Software
End to End Workflow Monitoring Panorama 360 George Papadimitriou
Presentation transcript:

1

2 Goal  “make it easier for scientists to conduct large-scale computational tasks that use the power of computing resources they do not own to process data they did not collect with applications they did not develop”  In practice: Monitoring, modeling and resource provisioning

3 Overview of the Resource Provisioning Loop 3 Workload Characterization Resource Allocation Execution Monitoring Workload Archive dV/dt Execution Traces Workload Estimation

Monitoring Resource Usage

5 HTC Monitoring (USC and ND)  Job wrappers that collect information about processes –Runtime, peak disk usage, peak memory usage, CPU usage, etc.  Mechanisms –Polling (not accurate, low overhead) –ptrace() system call interposition (accurate, high overhead) –LD_PRELOAD library call interposition (accurate, low overhead)  Kickstart (Pegasus) and resource-monitor (Makeflow) 5 Gideon Juve, et al., Practical Resource Monitoring for Robust High Throughput Computing, University of Southern California, Technical Report , PollingLD_PRELOADPtrace (fork/exit)Ptrace (syscalls) CPU0.5% - 12%0.5% - 5%< 0.2% Memory2% - 14%< 0.1%~ 0% I/O2% - 20%0% PollingLD_PRELOADPtrace (fork/exit)Ptrace (syscalls) CPUlow Memorylowmediumlowmedium I/Olow high Error (Accuracy) Overhead

6 HPC Monitoring (ALCF)  Job information from scheduler (Cobalt) –Use scheduler data for both scheduler and individual task data –Job runtime, number of cores, user estimates, etc.  I/O using Darshan –Instrumentation automatically linked into codes at compile time –Captures POSIX I/O, MPI I/O and some HDF5 and NetCDF functions –Amount read/written, time in I/O, files accessed, etc. –Very low overhead in both time and memory  Performance Counters using AutoPerf –Using built-in hardware performance counters –Also enabled at compile time –Counters zeroed in MPI_Init, and reported in MPI_Finalize –FLOPs, cache misses, etc. –Users can take control of performance counters preventing this from working 6

Workload Modeling and Characterization

8 CMS Workload Characteristics (USC, UW-M) 8 CharacteristicData General Workload Total number of jobs1,435,280 Total number of users392 Total number of execution sites75 Total number of execution nodes15,484 Jobs statistics Completed jobs792,603 Preempted jobs257,230 Exit code (!= 0)385,447 Average job runtime (in seconds)9,444.6 Standard deviation of job runtime (in seconds)14,988.8 Average disk usage (in MB)55.3 Standard deviation of disk usage (in MB)219.1 Average memory usage (in MB)217.1 Standard deviation of memory usage (in MB)659.6 Characteristics of the CMS workload for a period of a month (Aug 2014)

9 Correlation Statistics Weak correlations suggest that none of the properties can be directly used to predict future workload behaviors Two variables are correlated if the ellipse is too narrow as a line Workload Characterization 9

10 Correlation measures are sensitive to the data distribution Probability Density Functions Do not fit any of the most common families of density families (e.g. Normal or Gamma) Our approach Recursive partitioning method to combine properties from the workload to build Regression Trees Workload Characterization (2) 10

11 The recursive algorithm looks for PDFs that fit a family of density In this work, we consider the Normal and Gamma distributions Measured with the Kolmogorov- Smirnov test (K-S test) Regression Trees 11 The PDF for the tree node (in blue) fits a Gamma distribution (in grey) with the following parameters: Shape parameter = 12 Rate parameter = 5x10 -4 Mean = p-value = 0.17

12 Job Estimation: Experimental Results 12 Job Runtime Disk Usage Memory Usage Based on the regression trees We built a regression tree per user Estimates are generated according to a distribution (Normal or Gamma) or a uniform distribution Average accuracy of the workload dataset The training set is defined as a portion of the entire workload dataset The median accuracy increases as more data is used for the training set

Provisioning and Resource Allocation

14 Resource Allocation (ND) Tasks have different sizes (known at runtime) while computation nodes have fixed sizes Resource allocation strategies One task per node  Resources are underutilized  Throughput is reduced Many tasks per node  Resources are exhausted  Jobs fail  Throughput is reduced Tasks Computation Nodes 14

15 General Approach Setting tasks What do we know?  Maximum size?  Size probability distribution?  Empirical distribution?  Perfect information? Our approach Setting task sizes to reduce resource waste  Modeling of resource sizes (e.g., memory, disk, or network bandwidth)  Assumes the task size distribution is known  Adapts to empirical distributions Success Task of unknown size Compute some task size Run the task in a node with the available space. Monitor task, and kill it if resources exceeded Record result Record failure Failure Already max size 15

16 Resource Waste Modeling Model the task resource as a function of time Model the task resource usage as resource x time (area below the curve) Overestimating size (waste is the area above the curve) Underestimating size (waste is resource x time until resource exhaustion) 16 Single Peaks Model Simplifying assumption: any resource exhaustion only happens at time of maximum peak (i.e., resource usage looks like a step function)

17 Synthetic Workload Experiment Exponential Distribution 5000 Tasks Memory according to an exponential distribution  Shifted min 10 MB, truncated max 100 MB, average 20 MB Tasks run anywhere from 10 to 20 seconds 100 computation nodes available, from ND Condor pool Each node with 4 cores and a limit of 100 MB of memory 17

18 Example: One,Two and Multi-step sequences with “Slow Peaks” 18 normalized resource units per task (less is better) multi-step (as previous slide, but in one column) one-step (always max) two-step (as optimal in previous table)

19 Next Steps  Improve monitoring and modeling –Investigate network I/O and energy –Extend modeling to parallel, HPC applications  Close the loop –Turn on detailed monitoring in workflows –Use resource predictions for provisioning and scheduling  Productize tools –Deploy monitoring capabilities in production environments –Turn modeling software into a service 19