Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption.

Slides:



Advertisements
Similar presentations
Making the System Operational
Advertisements

PC Construction and Maintenance Week 10 General Troubleshooting.
Availability in Globally Distributed Storage Systems
Database Administration and Security Transparencies 1.
Availability in Globally Distributed Storage Systems
5th Conference on Intelligent Systems
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Managing Information Systems Information Systems Security and Control Part 1 Dr. Stephania Loizidou Himona ACSC 345.
E-commerce Project Erik Zeitler Erik Zeitler2 Lab 2  Will be anounced and scheduled later  We will deploy Java Server Pages on a Tomcat server.
Keith Burns Microsoft UK Mission Critical Database.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
12.
CCSB223/SAD/CHAPTER141 Chapter 14 Implementing and Maintaining the System.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 20 Slide 1 Integration testing l Tests complete systems or subsystems composed of integrated.
INFO 637Lecture #81 Software Engineering Process II Integration and System Testing INFO 637 Glenn Booker.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
The Basic Input/Output System Unit objectives: Access the BIOS setup utility, change hardware configuration values, and research BIOS updates Explain the.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Chapter 2: Non functional Attributes.  It infrastructure provides services to applications  Many of these services can be defined as functions such.
1 Software testing. 2 Testing Objectives Testing is a process of executing a program with the intent of finding an error. A good test case is in that.
Do You Need To Run 24/7? So How Do You Do System Maintenance? “Spiral Development” With Advanced Traffic Control, Inc.
Data and Database Administration Chapter 12 (Contd.)
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
Basic Input/Output System
T HE BE/CO T ESTBED AND ITS USE FOR TIMING AND SOFTWARE VALIDATION 22 June BE-CO-HT Jean-Claude BAU.
CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Information Technology Report Trey Felton Manager, IT Service Delivery October 2011 ERCOT Public.
NERC Lessons Learned Summary LLs Published in September 2015.
June 2010 COPS/RMS Information Technology Report Trey Felton Manager, IT Administration.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
TESTING FUNDAMENTALS BY K.KARTHIKEYAN.
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Security Operations Chapter 11 Part 2 Pages 1262 to 1279.
MAJOR SOFTWARE FAILURES, WHY THEY FAILED AND LESSONS LEARNED BY AKPABIO UWANA.
CS203 – Advanced Computer Architecture Dependability & Reliability.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
FMS: A COMPUTER NETWORK FAULT MANAGEMENT SYSTEM BASED ON THE OSI STANDARDS Norleyza Jailani and Ahmed Patel Malaysian Journal of Computer Science, Vol.
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Introduction to High Availability
Software Metrics and Reliability
Providing Application High Availability
High Availability 24 hours a day, 7 days a week, 365 days a year…
Embracing Failure: A Case for Recovery-Oriented Computing
What, When, Why, Where and How SCC maintains your Oracle database
Fault Tolerance & Reliability CDA 5140 Spring 2006
Maximum Availability Architecture Enterprise Technology Centre.
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Chapter 18 Software Testing Strategies
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Software testing strategies 2
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
Hardware-less Testing for RAS Software
System Start-Up and Shutdown
Abstractions for Fault Tolerance
Presentation transcript:

Practical Reports on Dependability

Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption Slow down

PAGE UNAVAILABLE

System Exception

Performance Slowdown

DOWNTIME 15% contribution

DOWNTIME unplanned 20 % planned 80 %

DOWNTIME

UNPLANNED DOWNTIME

Software Errors Triggers Resource exhaustion Logical errors System Overload Recovery code Failed upgrade

Logical Error

SYSTEM OVERLOAD

Operator Errors Triggers Configurational –Incorrect parameter setting Procedural –Omit/inncorect maintainance action Miscellaneous

FAILURE DURATION Short (minutes) Long (weeks) –Implies large fault chains FREQUENCY Permanent (down until problem fixed) Transient (resolves without intervention) Intermittent (trasient + occasional) SCOPE Entire system Parts of the System

Fault Chains ”the series of component failures that led up to a user- visible failure” Uncoupled –Independent failures Tightly Coupled –Cascading/corelated failure

Non-Malicious Software Failure Most Common Causes –Routine maintenance –Software upgrade –System integration Other Causes –System overload –Resource exaustsion –Complex fault tolerant routines

”ROUTINE” MAINTAINANCE Danske Bank 2003 –March 11: routine operation to replace a defective electrical unit in IBM DB2 disk system –System failure: Disks becomes inaccessable –6 hours later: system restarted –March 12: Batch systems running incorrectly –Three More errors discovered: 1.Recovery process on several tables won’t start 2.Recovery jobs won’t run symultaneously 3.Recovery jobs can’t reastablish data in tables –March 14: All data recovered and system functional