Download presentation
Presentation is loading. Please wait.
Published byWilliam Benson Modified over 9 years ago
1
Lecture 4: State-Based Methods CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at UIUC
2
Introduction
3
Availability: A motivation for State- Based Methods Recall that availability quantifies the alternation between proper and improper service. – A(t) is 1 if service is proper, 0 otherwise. – E[A(t)] is the probability that the service is proper at time t. – A(0,t) is the fraction of time the system delivers proper service during [0,t]
4
Availability For many systems, availability is a more “user- oriented” measure than reliability. – It is often more difficult to compute, since it must account for repair and/or replacement.
5
Availability: Example A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours. A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
6
Availability: Example A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours. A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average. Reliability, – for the first transmitter MTTF = 500 days – for the second transmitter MTTF = 150 days First transmitter is 3.33x more reliable.
7
Availability: Example A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours. A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average. For for the more reliable transmitter, and for the less reliable transmitter.
8
Availability: Example A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours. A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average. For for the more reliable transmitter, and for the less reliable transmitter. Higher reliability doesn’t necessarily mean higher availability!
9
Availability: Example A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours. A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average. For for the more reliable transmitter, and for the less reliable transmitter. Higher reliability doesn’t necessarily mean higher availability!
10
Availability Modeling using Combinatorial Methods Availability modeling can be done with combinatorial methods, but only with independent repair assumptions, and exponential life-time distributions. This uses the theory of ON/OFF processes “On” time distribution is reliability distribution, obtained using combinatorial methods. We represent mean as E[On]. “Off” distribution is the repair time distribution. Mean is E[Off].
11
Availability Modeling using Combinatorial Methods Availability is the fraction of the time in the On state. Asymptotically, if instances of On periods are independent and identically distributed (i.i.d.) and instances of Off periods are i.i.d. then:
12
Availability Modeling using Combinatorial Methods Lots of assumptions that may not hold!
13
State-Based Methods More accurate modeling can be achieved with state-based methods. – State-based methods relax the independence assumptions needed for combinatorial modeling! – Failures need not be independent. Failure of one component may make another component more or less likely to fail! – Repairs need not be independent. Repair and replacement strategies are an important component that must be modeled in high-availability systems. – High availability systems may operate in a degraded mode. In a degraded mode, the system may deliver only a fraction of its services. The repair process may start only when a system is sufficiently degraded.
14
Repairs and Independence
17
NCSA’s Blue Waters 288 cabinets 26.4 PB of usable storage
18
NCSA’s Blue Waters 288 cabinets 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years.
19
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk
20
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk > 27,000 TB of usable storage ~ 34,000 TB of actual storage (RAID)
21
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk > 27,000 TB of usable storage ~ 34,000 TB of actual storage (RAID) … 17,000 disks
22
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk > 27,000 TB of usable storage ~ 34,000 TB of actual storage (RAID) … 17,000 disks
23
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk > 27,000 TB of usable storage ~ 34,000 TB of actual storage (RAID) … 17,000 disks
24
NCSA’s Blue Waters 26.4 PB of usable storage Let’s say the MTTF of a disk is 3 years. 2TB per disk > 27,000 TB of usable storage ~ 34,000 TB of actual storage (RAID) … 17,000 disks Rate of failure for ANY disk
25
NCSA’s Blue Waters We are losing 15+ drives every day. But loss of a drive isn’t necessarily loss of data, right? – 17,000 drives in 8+2 means 1,700 RAID groups. – Chance of RAID failure on day 1 is only 6%. – 24% on day 2 Rather than replace every drive immediately when it fails, let’s replace many drives all at once.
26
State-Based Methods We use random processes to model these systems. We use “state” to “remember” the conditions leading to dependencies.
27
Random Processes A random process is a collection of random variables indexed by time. – Useful for characterizing the behavior of real systems. X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …}
28
Random Processes X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …} P[X(2) = 12] = 1/36 P[X(3) = 14|X(1) = 2] = 1/36 E[X(n)] = 3.5n Etc.
29
Random Processes If X is a random process, X(t) is a random variable A random variable Y is a function that maps A random process X maps elements in the two dimensional space to elements in
30
Random Processes A sample path of X is the history of sample space values X adopts as a function of time. When we fix t, then X becomes a function When we fix X becomes a function By fixing (e.g., the system is available) and observing X as a function of T, we see a trajectory of the process sampling or not.
31
Describing a Random Process Recall: for a random variable X, we can use the cumulative distribution to describe the random variable. In general, no such simple description exists for a random process.
32
Describing a Random Process A random process can often be described succinctly in various ways, for example, say Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls. We can describe X(t) as: X(t) – X(t-1) = Y P[X(t) = i|X(t-1) = j] = P[Y = i – j], Or where each Is independent.
33
Classifying a Random Process: Characteristics of T If the number of points defined by a random process, i.e. |T|, is finite or countable then the random process is said to be a discrete-time random process. If |T| is uncountable then random process is said to be a continuous-time random process. Let X(t) be the number of fault arrivals in a system up to time t. What time of process is X(t)?
34
Classifying a Random Process: State Space Type Let X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e. If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.
35
Classifying a Random Process: State Space Type If the state space, S, of a random process, X, is finite or countable, then X is said to be a discrete-state random process. If the state space S of a random process X is infinite and uncountable then X is said to be a continuous-state random process.
36
Classifying a Random Process: State Space Type Let X be a random process that represents the number of bad packets received over a network. – Classify the state space.
37
Classifying a Random Process: State Space Type Let X be a random process that represents the voltage on a telephone line. – Classify the state space.
38
Classifying a Random Process: State Space Type We will concern ourselves primarily with discrete-state processes.
39
Random Process Classification Examples
40
Markov Process A special type of random process that we will examine is called a Markov process. A Markov process can be defined, informally, as follows: Given the state of a Markov process X at time t, the future behavior of X can be described completely in terms of X(t). Markov processes have the very useful property that their future behavior is independent of past values.
41
Markov Chains A Markov chain is a Markov process with a discrete state space. We will always make the assumption that a Markov chain has a state space in {1, 2, …} and that it is time-homogeneous. A Markov chain is time-homogenous if its future behavior does not depend on what time it is, only the current state. We formalize this property by looking at a discrete-time Markov chain (DTMC). A DTMC X has the following property:
42
DTMCs Given i, j, and k, is a number. can be interpreted as the probability that X has value i, then after k time-steps, X will have value j. Frequently we write to mean
43
DTMC Suppose we have a processor, it can be working or failed at some time T.
44
DTMC Suppose we have two processors, which can both be working or failed, independently, at time T.
45
State Occupancy Probability Vector Let be a row vector. We denote to be the i-th element of the vector. If is a state occupancy probability vector, then is the probability that a DTMC is in state i at time-step k. Assume that a DTMC X has a state-space of size n, i.e. S = {1, 2, …, n}. We say formally: Note that
46
State Occupancy Probability Vector Computing a single step forward in time. Given an initial which is the initial probability vector (where we are likely to be at t=0), and how do we compute ? Recall the definition of
47
State Occupancy Probability Vector Since
48
State Occupancy Probability Vector We have which holds for all j. This resembles vector-matrix multiplication. If we arrange the matrix Then
49
State Occupancy Probability Vector The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector, and a transition probability matrix P.
50
Transient Behavior of DTMCs
51
A Simple Example Suppose the weather in Cincinnati can be modeled the following way:
52
Simple Example, cont.
53
Simple Example, Solution
54
Solution, cont.
55
Graphical Representation
56
Simple Computer Example
57
Limiting Behavior of DTMCs
58
For next time Apply for a Mobius account Use your uc.edu e-mail address, specifically mention this class and instructor.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.