Fault Detection Sathish S. Vadhiyar Source/Credits: From Referenced Papers.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Process Description and Control Module 1.0. Major Requirements of an Operating System Interleave the execution of several processes to maximize processor.
Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
6/25/2015Page 1 Process Scheduling B.Ramamurthy. 6/25/2015Page 2 Introduction An important aspect of multiprogramming is scheduling. The resources that.
Intrusion detection Anomaly detection models: compare a user’s normal behavior statistically to parameters of the current session, in order to find significant.
Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.
7/12/2015Page 1 Process Scheduling B.Ramamurthy. 7/12/2015Page 2 Introduction An important aspect of multiprogramming is scheduling. The resources that.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Process Description and Control A process is sometimes called a task, it is a program in execution.
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
ENVS 355 Data, data, data Models, models, models Policy, policy, policy.
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
Process Control. Module 11 Process Control ♦ Introduction ► A process is a running occurrence of a program, including all variables and other conditions.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
CPU Scheduling Gursharan Singh Tatla 1-Feb-20111www.eazynotes.com.
A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.
Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.
1 Virtual Memory Chapter 9. 2 Resident Set Size n Fixed-allocation policy u Allocates a fixed number of frames that remains constant over time F The number.
3.1 : Resource Management Part2 :Processor Management.
Copyright ©: Lawrence Angrave, Vikram Adve, Caccamo 1 Virtual Memory.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Deadlocks II.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
1 Presented by Sarbagya Buddhacharya. 2 Increasing bandwidth demand in telecommunication networks is satisfied by WDM networks. Dimensioning of WDM networks.
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
6.1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Chapter 8 Deadlocks. Objective System Model Deadlock Characterization Methods for Handling Deadlocks Deadlock Prevention Deadlock Avoidance Deadlock Detection.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
CPU Scheduling CSSE 332 Operating Systems
CPU SCHEDULING.
PROCESS MANAGEMENT IN MACH
Process Description and Control
Introduction to Operating System (OS)
Real-time Software Design
Process Scheduling B.Ramamurthy 9/16/2018.
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Software Reliability Models.
Predictive Performance
Process Scheduling B.Ramamurthy 11/18/2018.
Chapter 7 Deadlocks.
CPU Scheduling G.Anuradha
Basic Grid Projects – Condor (Part I)
Process Scheduling B.Ramamurthy 12/5/2018.
TDC 311 Process Scheduling.
Process Description and Control
CPU SCHEDULING.
Process Description and Control
Threads Chapter 4.
Process Description and Control
Process Description and Control
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 4/11/2019.
Process Scheduling B.Ramamurthy 4/7/2019.
Uniprocessor scheduling
Process Scheduling B.Ramamurthy 4/19/2019.
Uniprocessor Process Management & Process Scheduling
Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads
Process Scheduling B.Ramamurthy 4/24/2019.
Process Description and Control
Process Scheduling B.Ramamurthy 5/7/2019.
Uniprocessor Process Management & Process Scheduling
Presentation transcript:

Fault Detection Sathish S. Vadhiyar Source/Credits: From Referenced Papers

Introduction  Fine Grain Cycle Sharing (FGCS)  Host computers allow guest jobs to utilize CPU cycles  Availability of host computers vary  Guest jobs may incur resource failures  Need to predict availability of host computers  A scheduling system can allocate guest jobs based on the availability of host computers

Kinds of Non Availabilities  FRC (Failures Caused by Resource Contention)  A guest job may significantly impact host processes  Hence a guest job can be removed  FRR (Failures Caused by Resource Revocation)  A machine owner suspends resource contribution without notice  Hardware-software failures occur

Resource Failure Prediction  A multi-state failure model and application of a semi-Markov Process (SMP) to predict the temporal reliability  Predicting probability that no resource failure will occur on a machine in a future time window  Observing host resource usage values in a time window; calculating parameters of SMP based on host resource usage values

Multi-state resource failure model  FRR – 2 states  A machine is either available or unavailable  FRC  Failures when host processes incur noticeable slowdown due to contention from guest processes  A host processor can first decrease the priority of guest processes; If this does not help, the guest process is terminated  Measured host resource usage as indicators of noticeable slowdown

Initial Experiments  To study relations between host resource usage and FRC - Experiments conducted to simulate resource contentions between a guest process and host processes  Host-group – an aggregated set of host processes with various resource usages  Slowdown of host group – reduction of its CPU utilization due to contending guest process  Host programs are run with their isolated CPU usage between 10% and 100%  Guest process – a CPU bound program

Experiments on CPU contention  Also measured reduction rate of host CPU usage for a host-group  Experiments repeated with different host groups with host priority 0, and guest priority 0 and 19 (renice)  Measured reduction rate plotted as function of isolated host CPU usage, L H  Found 2 thresholds for LH  Th1 – highest value of LH when guest process needs to be reniced to keep reduction rate below 5%  Th2 – highest value of LH when guest process needs to be suspended to keep reduction rate below 5%

State model for LRC  3 states  S1 - When LH < Th1; ignore resource contention due to guest processes; slowdown already less than 5%  S2 - When Th1 < LH < Th2; renice guest processes for slowdown to be < 5%  S3 - When LH > Th2; terminate guest process

Experiments on CPU and Memory Contention  When memory trashing occurs  Total memory of guest and host processes exceed available memory size  Experiments were conducted to verify memory trashing does not depend on guest priority  S4 – for failure due to memory trashing

Multi-State Failure Model  Proposed prediction algorithm is to predict the probability that a machine will never transfer to S3, S4, or S5 within a future time window  Transitions  Between S1, S2, S3 – decided by measured host CPU usage  To S4 – when memory is limited

Semi-Markov Process Model (SMP)  Applicable when next transition depends only on  Current state  How long the system at the current state  Transition probabilities depend on amount of time elapsed since last change in state  SMP is defined by a 3-tuple  S – finite set of states  Q – state transition matrix  H – holding time mass function matrix

SMP (Contd…)   The most important statistics of SMP - Interval transition probabilities, P  To calculate P  Continuous time SMP is expensive  Hence the work develops a discrete time SMP model

SMP for Resource Availability  TR – probability of never transferring to S3, S4 or S5 within an arbitrary time window, W  S init – initial system state  W – W init + T  Q and H calculated based on statistics from history logs due to monitoring host resource usage

SMP for Resource Availability  P i,j (m) = P i,j (W init, W init +m)  P 1 i,k (l) – interval transition probabilities for a one-step transition  d – time unit of a discretization interval  Q and H calculated based on statistics from history logs due to monitoring host resource usage

System Design and Implementation  Client requests job submission  Client’s job scheduler queries the gateways on available machines for temporal availabilities  Chooses a machine and spawns a guest job  During job execution, monitor detects state transition and notifies gateway  Gateway renices or kills the guest processes accordingly  Resource monitor uses simple cpu commands like `top’ to calculate cpu usages

Computation in Solving SMP  Matrix sparsity in SMP is exploited to reduce computations  The sparse matrix is constructed based on 2 facts:  It takes a finite amount of time to transition from one state to another  S3, S4, S5 are unrecoverable failure states

Prediction Accuracy TR gets close to 0 for large time windows

Appropriate Training Size

Comparison with Linear Regression Techniques

Injecting Noises

References  Resource Failure Prediction in Fine- Grained Cycle Sharing Systems. X. Ren, S. Lee, R. Eigenmann, S. Bagchi. HPDC 2006.