Reliability and Dependability in Computer Networks CS 552 Computer Networks Side Credits: A. Tjang, W. Sanders.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Availability in Globally Distributed Storage Systems
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
1 Chapter 5 Continuous time Markov Chains Learning objectives : Introduce continuous time Markov Chain Model manufacturing systems using Markov Chain Able.
Markov Reward Models By H. Momeni Supervisor: Dr. Abdollahi Azgomi.
NETE4631:Capacity Planning (3)- Private Cloud Lecture 11 Suronapee Phoomvuthisarn, Ph.D. /
Reliable System Design 2011 by: Amir M. Rahmani
Markov Method of Availability Analysis, APS systems, and Availability Simulation Methods E E 681 Module 21 W. D. Grover TRLabs & University of Alberta.
Distributed components
Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 5. Markov Models Andrea Bobbio Dipartimento di Informatica Università del Piemonte.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Discrete time Markov chains (Sec. 7.1)
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano.
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Operating Systems: Principles and Practice
CS6800 Advanced Theory of Computation Fall 2012 Vinay B Gavirangaswamy
Latency as a Performability Metric for Internet Services Pete Broadwell
Software Integration and Documenting
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
Software Reliability Growth. Three Questions Frequently Asked Just Prior to Release 1.Is this version of software ready for release (however “ready” is.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
AN INTRODUCTION TO THE OPERATIONAL ANALYSIS OF QUEUING NETWORK MODELS Peter J. Denning, Jeffrey P. Buzen, The Operational Analysis of Queueing Network.
1 Software Testing and Quality Assurance Lecture 33 – Software Quality Assurance.
Performance Evaluation of Computer Systems Introduction
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
NETE4631:Capacity Planning (2)- Lecture 10 Suronapee Phoomvuthisarn, Ph.D. /
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Entities and Objects The major components in a model are entities, entity types are implemented as Java classes The active entities have a life of their.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Quality of System requirements 1 Performance The performance of a Web service and therefore Solution 2 involves the speed that a request can be processed.
Lecture 4: State-Based Methods CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at UIUC.
Generalized stochastic Petri nets (GSPN)
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
© 2015 McGraw-Hill Education. All rights reserved. Chapter 19 Markov Decision Processes.
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
MSE Portfolio Presentation 1 Doug Smith November 13, 2008
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
NETE4631: Network Information System Capacity Planning (2) Suronapee Phoomvuthisarn, Ph.D. /
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
Measuring End-User Availability on the Web: Practical Experience Matthew Merzbacher (visiting research scientist) Dan Patterson (undergraduate) Recovery-Oriented.
Requirement Elicitation Review – Class 8 Functional Requirements Nonfunctional Requirements Software Requirements document Requirements Validation and.
Software Metrics and Reliability
OPERATING SYSTEMS CS 3502 Fall 2017
Discrete-time Markov chain (DTMC) State space distribution
Availability Availability - A(t)
Large Distributed Systems
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
DTMC Applications Ranking Web Pages & Slotted ALOHA
Chapter 1: Introduction
Software Reliability: 2 Alternate Definitions
Introduction to Operating Systems
KPI Familiarisation
Overview Dependability: "[..] the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers [..]"
Chapter-5 Traffic Engineering.
Presentation transcript:

Reliability and Dependability in Computer Networks CS 552 Computer Networks Side Credits: A. Tjang, W. Sanders

Outline Overall dependability definitions and concepts Measuring Site dependability Stochastic Models Overall Network dependability

Fundamental Concepts Dependable systems must define: –What is the service? Observed behavior by users –Who/what is the user? –What is the service interface? How to user’s view the system? –What is the function? (intended use)

Concepts (2) Failure: –incorrect service to users Outage: – time interval of incorrect service Error: –system state causing failure Fault: –Cause of error –Active (produces error) and latent

Causal Chain FaultErrorFailure causationpropagationactivation

Dependability Definitions

Attributes Availability –% of time delivering correct service Reliability –Expected time until incorrect service Safety –Absence of catastrophic consequences Confidentiality –Absence of unauthorized disclosure

Attributes (2) Integrity –Absence of improper states Maintainability –Ability to undergo repairs Security –Availability to authorized users –Confidentiality –Integrity

Means to Dependable Systems Fault prevention Fault tolerance Fault removal Fault forecasting

MTTF and MTTR Mean Time To Failure (MTTF) –Average time to a failure Mean Time To Repair –Average time under repair Availability = % time correct = (Time for correct service) / total time –Simple model: = MTTF/(MTTF+MTTR)

Availability classes and nines System TypeUnavailabilityAvailabilityClass(9’s) (min/year)(%) Unmanaged52,56090%1 Managed5,25699%2 Well-Managed %3 Fault-tolerant %4 High-availability %5 Very-highly available %6 Ultra-highly available %7

Latency What about “how long” to perform the service? Strict bound: –Correct service within T counts, > T = fault Statistical bound: –N % of requests within <= T

Volume Correctness depends on quality of the result These systems tend to perform actions on large data sets: Search engines Auctions/pricing For a given request, parameterize correctness by % of answers returned as if we used the entire data set

Measuring End-User Availability on the Web: Practical Experience Matthew Merzbacher Dan Patterson UC Berkeley Presented by Andrew Tjang

Introduction Availability, performance, QoS important in Web Svcs. End user experience -> meaningful benchmark Long term experiments attempted to duplicate end user experience Find out what the main causes are for downtime as seen by end user.

Driving forces Availability/uptime in “9”s not accurate –Optimal conditions, not real-world Actual uptime to end users include many factors –Network, multiple sw layers, client sw/hw Need meaningful measure of availability rather one number characterizing unrealistic operating environment

The Experiment Mills College/UC Berkeley devised experiment over several months Made hourly contact on a list of several prominent/not-so-prominent sites Characterized availability using measures of success, speed, size Attempted to pinpoint area of failures

Experiment (cont’d) Coded in Java Tested local machines as well (to determine baseline and determine local problems Random minutes each hour Results form 3 types of sites –Retailer –Search engine –Directory service

Results Availability broken up into sections –Raw, local, network, transient Kinds of errors broken up into –Local, Severe network, Corporate, Medium Network, Server Was response upon success partial? How long?

Different Tiers of Availability AllRetailerSearchDirectory Raw (Overall) Ignoring local problems Ignoring local and network problems Ignoring local, network, and transient problems

Types of Errors Local (82%) Network: Medium (11%) Severe (4%) Server (2%) Corporate (1%)

Local Problems Most common problem Caused by –System crashes, sysadmin problems, config problems, attacks, power outages, etc… All had component of human error, but no clear way to solve via preventative measures “Local availability dominates the end-user experience”

Lost Data and Corporate Failure Just because response was received doesn’t mean service was available Experiment kept track of pages that appeared to be of a drastically different size (smaller) as unavailable (I.e. 404) If international versions failed -> corporate failure

Response time Wanted to define what “too slow” is –Chart availability vs. time –Asymptotic towards availability of 1 –Choose threshold, all response times > considered unavailable –Client errors most frequent type of error, then transient network

How long should we wait?

Retrying To users, unavailability leads to retry at least once How effective is a retry? –Need to test for persistence of failures –Consistent failures indicate server Persistent, non-local failures –Domain dependent

Retrying (cont’d) Retry period of 1 hour unrealistic As in brick & mortar, clients have choice #retries, time btwn retries, etc based on domain/user dependent factors –Uniqueness, import, loyalty, transience

Effect of retry Error TypeAllRetailerSearchDirectory Client Medium Network Severe Network Server Corporate n/a Green > 80%Red < 50%

Conclusion Successful in modeling user experience 93% Raw, 99.9% removing local/short-term errors Retry produced better availability, reduced error 27% in local, 83% non-local Factoring in retries produces 3 “9s” of availability. Retry doesn’t help for local errors –User may be aware of the problem and therefore less frustrated by it

Future Work Continue experiment, refine availability stats Distribute experiment across distant sites to analyze source of errors Better experiments to determine better the effects of retry With the above, we can pinpoint source of failures and make more reliable systems.

Stochastic Analysis of Computer Networks Capture probabilistic behavior as a function of time Formalisms: –Markov Chains Discrete Continuous –Petri Nets

Markov Chains States Transitions Transition probabilities Evolution over time Compute: –Average time spent in a state Fraction of time time –Expected time to reach a state –“Reward” for each state

Discrete Markov chains States of the system Time modeled in discrete, uniform steps –(e.g. every minute) Each state has a set of transition probabilities to other states for each time step –Sum of probabilities == 1

Discrete Markov Chain Graphical representation Probability transition matrix representation

Continuous Time Markov Chains States of the system (as before) Transitions are rates with exponential distributions –Some event arrives at rate  with an an exponential interarrival time

Continuous Time Markov Chain Rows sum to zero for steady state behavior

Answers from the CTMC model Availability at time t? Steady state availability? Expected time to failure? Expected number of jobs lost due to failure over [0,t]? Expected number of jobs served before failure? Expected throughput given failures and repairs

Petri Nets Higher level formalism Components: –Places –Transitions –Arcs –Weights –Markings

Petri Nets

General Stochastic PN Either exponential or instant departures

ATM Example