By: Swetha Kendyala Software Rejuvenation.

Slides:



Advertisements
Similar presentations
PROCESS MANAGEMENT Y. Colette LeMard.
Advertisements

KPI Familiarisation.
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Thank you for your introduction.
DEEDS SW Ageing and Rejuvenation. DEEDS SW Reliability HW ages (physically) Failure Rate λ = (1 failure every million hours) R(t) = e – λt What.
Module – 9 Introduction to Business continuity
Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.
1 Fault-Tolerance Techniques for Mobile Agent Systems Prepared by: Wong Tsz Yeung Date: 11/5/2001.
Software Rejuvenation: Analysis, Module and Applications Yennun Huang Chandra Kintala Nick Kolettis N. Dudley Fulton Chris L. Del Checcolo.
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
5/18/2015CPE 731, 4-Principles 1 Define and quantify dependability (1/3) How decide when a system is operating properly? Infrastructure providers now offer.
DataBase Administration Scheduling jobs Backing up and restoring Performing basic defragmentation and index rebuilding Using alerts Archiving.
Reliable System Design 2011 by: Amir M. Rahmani
BSc/HND IETM Week 9/10 - Some Probability Distributions.
1 Intro To Encryption Exercise Problem What may be the problem with a central KDC?
Operating System Support for Database Management
Network Management 1 School of Business Eastern Illinois University © Abdou Illia, Fall 2006 (Week 16, Tuesday 12/5/2006)
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Fault Prediction and Software Aging
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Network Management 1 School of Business Eastern Illinois University © Abdou Illia, Spring 2006 (Week 15, Friday 4/21/2006) (Week 16, Monday 4/24/2006)
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Software Reliability SEG3202 N. El Kadri.
C++ Programming Language Lecture 1 Introduction By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
Coordinating Failed Goods Collecting Policies and Repair Capacity Policies in the Maintenance of Commoditized Capital Goods Henny P.G. van Ooijen J. Will.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Transaction Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption.
A Networked Machine Management System 16, 1999.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Module 14 Monitoring and Optimizing SharePoint Performance.
Operating Systems 软件学院 高海昌 Operating Systems Gao Haichang, Software School, Xidian University 22 Contents  1. Introduction** 
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Deadlocks II.
Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Operating System Principles And Multitasking
Deadlock Detection and Recovery
COMP-01: How to Save Your Company Money and Look Like a Hero in 5 Easy Steps Cyril Gleiman Senior Technical Support Engineer.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
HalFILE 2.1 Network Protection & Disaster Recovery.
IT1001 – Personal Computer Hardware & system Operations Week7- Introduction to backup & restore tools Introduction to user account with access rights.
Cloud Computing and Architecture Architectural Tactics (Tonight’s guest star: Availability)
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures.
Improving Preventive Maintenance of Packaging Process Equipment in a Pharmaceutical Industry Final Presentation By: Anabel Rodriguez.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
CPU Scheduling CS Introduction to Operating Systems.
 Tata consultancy services Production Planning WORK CENTERS.
Process Management Deadlocks.
Faults and fault-tolerance
Cross-site problem resolution Focus on reliable file transfer service
Hands-On Microsoft Windows Server 2008
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Network Configurations
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
15 seconds left 30 seconds left 3 minutes left 2 minutes left 1 minute
Presentation transcript:

By: Swetha Kendyala Software Rejuvenation

When software applications execute continuously for long periods of time, the processes corresponding to the software in execution age or slowly degrades with respect to the effective usage of their system resources. Process aging will affect the performance and eventually cause the application to fail. Introduction

What is Software Rejuvenation? The act of gracefully terminating an application and immediately restarting Goal: Prevents unexpected error termination by terminating the program before it suffers an error

Intended Use Software rejuvenation is primarily indicated for servers where applications are intended to run indefinitely without failure

Why do applications fail? Process Aging: gradual degradation of application performance, over time, that may lead to premature program termination

Causes Memory leaks Unreleased file locks File descriptor leaking Etc.

Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future

Transition Model For SW without Rejuvenation Transition Model For SW with Rejuvenation

Downtime and cost without rejuvenation P f = Downtime w/o r (L) = P f * L Cost w/o r (L) = P f * L * c f

Downtime and cost with rejuvenation P p = P f = P r = P 0 = Downtime w r (L) = (P f + P r ) * L Cost w r (L) = (P f * c f + P r * c r ) * L

Thresholds - Goal Goal is to stay in S 0 for the longest amount of time

Thresholds cont. To see how r 4 affects downtime and cost, lets differentiate the previous equations with respect to r 4

Thresholds cont. Downtime: If r 3 is dominant, the derivative becomes negative and downtime decreases when r 4 increases thus rejuvenate at state S p If r 3 is small, slow recovery from S R, downtime increases as r 4 increases

Thresholds cont. Cost = When c r is dominant, cost increases as r 4 increases, implies no rejuvenation benefit When c r is small, cost decreases as r 4 increases

Thresholds cont. Overall, costs need to be calculated for individual programs For best results: perform rejuvenation at state S P (r 4 = ∞) or don’t perform rejuvenation (r 4 = 0)

Example 1 MTBF = 12 months; = 1/(12*30*24) Takes 30 min to recover from unexpected error; r 1 = 2 Base Longevity is seven days; r 2 =1/(7*24) If rejuvenation is performed, mean repair time after rejuvenation is 20 minutes; r 3 = 3 Ave. Cost of unscheduled downtime due to failure, c f, is $1,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $40/hour

Software Rejuvenation No rejuvenation (r 4 = 0) Once Every three Week r 4 = 1/(2*7*24) Once Every Two Weeks r 4 =1/(1*7*24) Hours of Downtime Cost of Downtime

Software Rejuvenation No rejuvenation (r 4 = 0) Once Every month r 4 = 1/(20*24) Once Every Two Weeks r 4 =1/(4*24) Hours of Downtime Cost of Downtime 3.6k2.48k1.11k

Example 2 MTBF = 3 months; = 1/(3*30*24) Takes 30 min to recover from unexpected error; r 1 = 2 Base Longevity is three days; r 2 =1/(3*24) If rejuvenation is performed, mean repair time after rejuvenation is 10 minutes; r 3 = 6 Ave. Cost of unscheduled downtime due to failure, c f, is $5,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $5/hour

Software Rejuvenation No rejuvenation (r 4 = 0) Once Every three Week r 4 = 1/(11*24) Once Every Two Weeks r 4 =1/(4*24) Hours of Downtime Cost of Downtime

Example 3 MTBF = 3 months; = 1/(3*30*24) Takes 2 min to recover from unexpected error; r 1 = 0.5 Base Longevity is 10 days; r 2 =1/(10*24) If rejuvenation is performed, mean repair time after rejuvenation is 10 minutes; r 3 = 6 Ave. Cost of unscheduled downtime due to failure, c f, is $5,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $5/hour

Implementation Implementation of Software Rejuvenation is fairly easy. Cron Jobs can be set to restart the application at various intervals watchd can be used to detect if applications have failed and restart them

Real World Examples BILL-DATS II Collector –Billing collection system used by AT&T long- distance network –Set to rejuvenate after 1 week –Hasn’t prematurely failed after several year

“S” Scientific Speech synthesis system Long running scientific application Used to process several hundred sentences over the course of many days Found to fail after 100 sentences Rejuvenates after 15

Conclusions: Decision to use Software Rejuvenation depends on predetermined failure rates and associated costs. r 4 = 0, No rejuvenation r 4 = ∞, Rejuvenation

Questions???