USING SIMPLE PID CONTROLLERS TO PREVENT AND MITIGATE FAULTS IN SCIENTIFIC WORKFLOWS Rafael Ferreira da Silva1, Rosa Filgueira2, Ewa Deelman1, Erola Pairo-Castineira3,

Slides:



Advertisements
Similar presentations
Tuning PID Controller Institute of Industrial Control,
Advertisements

10.1 Introduction Chapter 10 PID Controls
CHE 185 – PROCESS CONTROL AND DYNAMICS
PID Controllers and PID tuning
Modern Control Systems (MCS)
INDUSTRIAL AUTOMATION (Getting Started week -1). Contents PID Controller. Implementation of PID Controller. Response under actuator Saturation. PID with.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
PID Control -1 + Professor Walter W. Olson
Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.
Spark: Cluster Computing with Working Sets
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Process Control Instrumentation II
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
PID Control & ACS550 and ACH550
Proportional/Integral/Derivative Control
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Introduction to ROBOTICS
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
PID Controller Design and
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
PID. The proportional term produces an output value that is proportional to the current error value. Kp, called the proportional gain constant.
Chapter 7 Adjusting Controller Parameters Professor Shi-Shang Jang Chemical Engineering Department National Tsing-Hua University Hsin Chu, Taiwan.
PID CONTROLLERS By Harshal Inamdar.
CSCI1600: Embedded and Real Time Software Lecture 12: Modeling V: Control Systems and Feedback Steven Reiss, Fall 2015.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
1
CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
EEN-E1040 Measurement and Control of Energy Systems Control I: Control, processes, PID controllers and PID tuning Nov 3rd 2016 If not marked otherwise,
PID Control for Embedded Systems
Lecture 1: Operating System Services
Chapter 7. Classification and Prediction
Salman Bin Abdulaziz University
Chapter 7 The Root Locus Method The root-locus method is a powerful tool for designing and analyzing feedback control systems The Root Locus Concept The.
Control engineering and signal processing
Lec 14. PID Controller Design
CSCI1600: Embedded and Real Time Software
PID controller for improving Max power point tracking system
Supporting Fault-Tolerance in Streaming Grid Applications
6: Processor-based Control Systems
Basic Design of PID Controller
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
University of Southern California
Soft Error Detection for Iterative Applications Using Offline Training
Brief Review of Control Theory
An Adaptive Middleware for Supporting Time-Critical Event Response
ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC
Laura Bright David Maier Portland State University
Dynamical Systems Basics
Advanced LabViEW
A General Approach to Real-time Workflow Monitoring
PID Controller Design and
Outline Control structure design (plantwide control)
Jetson-Enabled Autonomous Vehicle
Presentation transcript:

USING SIMPLE PID CONTROLLERS TO PREVENT AND MITIGATE FAULTS IN SCIENTIFIC WORKFLOWS Rafael Ferreira da Silva1, Rosa Filgueira2, Ewa Deelman1, Erola Pairo-Castineira3, Ian Michael Overton4, Malcolm Atkinson5 1USC Information Sciences Institute 2British Geological Survey, Lyell Centre 3MRC Institute of Genetics and Molecular Medicine, University of Edinburgh 4Usher Institute of Population Health Sciences and Informatics, University of Edinburgh 5School of Informatics, University of Edinburgh 11th Workflows in Support of Large-Scale Science (WORKS’16) Salt Lake City, UT – November 14th, 2016

OUTLINE Introduction PID Controllers Defining Controllers Scientific Workflows Motivation Related Work Definition Control System Loop Data Management Memory Management Experimental Evaluation Tuning PID Controllers Summary Workflow Application Experiments Conditions Results and Discussion Ziegler-Nichols Method Experimental Evaluation Conclusions Future Research Directions R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

WHY SCIENTIFIC WORKFLOWS? Automate Recover Debug > > > Automation Enables parallel, distributed computations Automatically executes data transfers Recover & Debug Handles failures with to provide reliability Keeps track of data and files Reproducibility Reusable, aids reproducibility Records how data was produced (provenance) R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

over 14k failed jobs out of 80k MOTIVATION Grid computing 180k failed tasks out of 340k CMS (Aug 2014) 385k failed tasks out of 790k The Failure Trace Archive 26 datasets from 2006-2014 Mira (2014) over 14k failed jobs out of 80k R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

SOME APPROACHES TO HANDLE FAULTS Typical Approaches Statistical and Machine Learning Task Retries Task Resubmission Task Clustering Checkpointing Provenance … Linear Regression Neural Networks Classification Algorithms Tree-based Methods Support Vector Machines … Analytical Solutions Others Failure Modeling Markov Chains Principal Component Analysis Histograms … Exception Handling Game Theory … R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

… AND SOME OF THEIR LIMITATIONS Most of them make strong assumptions about resource and application characteristics Accurate estimates of such requirements are still a steep challenge Some approaches may overload the execution platform Most of the systems do not prevent faults, but mitigate them We seek for an approach to predict, prevent, and mitigate failures in end-to-end workflow executions across distributed systems under online and unknown conditions Some approaches are tied to a small set of applications <<< R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

< PID CONTROLLERS Proportional-Integral-Derivative Controller P Σ I + Setpoint + + Input Output Σ I Σ Control loop mechanism Widely used in industrial control systems - Temperature - Pressure - Flow rate - etc. PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening Process - + D z 0.0 0.5 1.0 1.5 Time Dead time Raise time Percent overshoot Settling time Steady-state error < R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

PID PROCESS VARIABLES Proportional-Integral-Derivative Controller Kp: Proportional gain constant Ki: Integral gain constant Kd: Derivative gain constant e: error defined as the difference between the setpoint and the process variable value Proportional-Integral-Derivative Controller proportional Present error derivative Prediction of future errors based on current rate of change integral Accumulation of past errors R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

DATA FOOTPRINT AND MANAGEMENT PID Controller P: the error between the setpoint, and the actual used disk space I: cumulative value of the proportional responses D: the difference between the current and the previous disk overflow (or underutilization) error values A run of scientific workflows that manipulate large data sets may lead the system to an out of disk space fault Actions u(t) < 0: data cleanup is used to remove unused data; or tasks are preempted u(t) > 0: the number of concurrent task executions may be increased R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

MEMORY USAGE AND MANAGEMENT PID Controller P: error between the setpoint value, and the actual memory usage I: cumulative value of previous memory usage errors D: difference between the current and the previous memory overflow (or underutilization) error values Actions u(t) < 0: tasks are preempted to prevent the system to run out of memory u(t) > 0: the WMS may spawn additional tasks for concurrent execution The performance of memory-intensive operations are often limited by the memory capacity of the resource where the application is being executed. R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

https://github.com/pegasus-isi/1000genome-workflow WORKFLOW APPLICATION 1000 Genome Sequencing Analysis Workflow Identifies mutational overlaps using data from the 1000 genomes project 22 Individual tasks, 7 Population tasks, 22 Sifting tasks, 154 Pair Overlap Mutations tasks, and 154 Frequency Overlap Mutations tasks (Total 359 tasks) Handle failures with to provide reliability The workflow consumes/produces over 4.4TB of data, and requires over 24TB of memory > https://github.com/pegasus-isi/1000genome-workflow R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENT SETUP Memory Disk Memory Compute Node 1 Shared filesystem shared memory Compute Node 1 Memory capacity 500GB Shared filesystem Disk Workflow Management System Compute Node 2 Memory PID Control Loop R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENT CONDITIONS we arbitrarily define our setpoint as 80% of the maximum total capacity (for both storage and memory usage, and a steady-state error of 5% Execution are performed under online and unknown conditions We assume Kp = Ki = Kd = 1 (no tuning) Reference Workflow Execution the decision on the number of tasks to be scheduled or preempted is computed as the min between the response value of the unique disk usage PID controller, and the memory PID controller per resource Computed offline under known conditions Averaged Makespan: ~106h (standard deviation < 5%) R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

OVERALL MAKESPAN EVALUATION Average workflow makespan for different configurations of the controllers Proportional Kp = 1, Ki = Kd = 0 Makespan: 138.76h Slowdown: 1.30 Proportional-Integral Kp = Ki = 1, Kd = 0 Makespan: 126.69h Slowdown: 1.19 only mitigates faults Proportional-Integral-Derivative Kp = Ki = Kd = 1 Makespan: 114.96h Slowdown: 1.08 prevents faults R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENTS: DATA FOOTPRINT PROPORTIONAL PROPORTIONAL INTEGRAL PROPORTIONAL INTEGRAL DERIVATIVE R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENTS: DATA FOOTPRINT PROPORTIONAL This process occurs at about 4h, and performs more than 6,000 preemptions R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENTS: MEMORY USAGE PROPORTIONAL PROPORTIONAL INTEGRAL PROPORTIONAL INTEGRAL DERIVATIVE R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

EXPERIMENTS: MEMORY USAGE PROPORTIONAL only a few tasks (on average less than 5) are preempted due to memory overflow R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

OVERALL RESULTS MEMORY USAGE DATA FOOTPRINT R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

TUNING PID CONTROLLERS Execution Environment The goal of tuning a PID loop is to make it stable, responsive, and to minimize overshooting Ziegler-Nichols Method Ziegler-Nichols tuning, using the oscillation method Turn the PID controller into a P controller by setting Ki = Kd = 0. Initially, Kp is also set to zero Increase Kp until there are sustained oscillations in the signal. This Kp value is the ultimate gain, Ku Measure the ultimate (or critical) period Tu of the sustained oscillations > Tuned gain parameters R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

TUNED GAIN PARAMETERS Avg. Makespan: 107.37h Avg. Slowdown: 1.01 Preempted Tasks: 18 Cleanup Tasks: 1 The key factor of its success is due to the specialization of the controllers to a single application DATA FOOTPRINT MEMORY USAGE R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

> SUMMARY Summary Conclusions Future Research Directions Conclusion Experimental results show that faults are detected and prevented before their occur, leading workflow execution to its completion with acceptable performance PID controllers should be used sparingly, and metrics (and actions) should be defined in a way that they do not lead the system to an inconsistent state We will investigate the simultaneous use of multiple control loops at the application and infrastructure levels, to determine to which extent this approach may negatively impact the system Future Research Directions R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, M. Atkinson Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows

USING SIMPLE PID CONTROLLERS TO PREVENT AND MITIGATE FAULTS IN SCIENTIFIC WORKFLOWS Thank You Questions? Rafael Ferreira da Silva, Ph.D. Research Assistant Professor Department of Computer Science University of Southern California rafsilva@isi.edu – http://rafaelsilva.com http://pegasus.isi.edu