Download presentation
Published byOscar Paul Modified over 9 years ago
1
Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi
2
Non-State Space Models
Recall that non-state-space models like RBDs and FTs can easily be formulated and (assuming statistical independence) solved for system reliability, system availability and system MTTF Each component can have attached to it A probability of failure A failure rate A distribution of time to failure Steady-state and instantaneous unavailability
3
Markov chain To model complex interactions between components, use other kinds of models like Markov chains or more generally state space models. Many examples of dependencies among system components have been observed in practice and captured by Markov models.
4
MARKOV CHAINS State-space based model
States represent various conditions of the system Transitions between states indicate occurrences of events
5
State-Space-Based Models
States and labeled state transitions State can keep track of: Number of functioning resources of each type States of recovery for each failed resource Number of tasks of each type waiting at each resource Allocation of resources to tasks A transition: Can occur from any state to any other state Can represent a simple or a compound event
6
State-Space-Based Model (Continued)
Drawn as a directed graph Transition label: Probability: homogeneous discrete-time Markov chain (DTMC) Rate: homogeneous continuous-time Markov chain (CTMC) Time-dependent rate: non-homogeneous CTMC Distribution function: semi-Markov process (SMP) Two distribution functions; Markov regenerative process (MRGP)
7
MARKOV CHAINS (Continued)
For continuous-time Markov chains (CTMCs) the time variable associated with the system evolution is continuous We will mean a CTMC whenever we speak of Markov model (chain)
8
Continuous Time Markov Chains
Chapter 8 Continuous Time Markov Chains
9
Formal Definition A discrete-state continuous-time stochastic process is called a Markov chain if for t0 < t1 < t2 < …. < tn < t , the conditional pmf satisfies the following Markov property: A CTMC is characterized by state changes that can occur at any arbitrary time Index space is continuous. The state space is discrete valued.
10
Continuous Time Markov Chain (CTMC)
A CTMC can be completely described by: Initial state probability vector for X(t0): Transition probability functions (over an interval)
11
pmf of X(t) Using the theorem of total probability
If v = 0 in the above equation, we get
12
Homogenous CTMCs is a (time-)homogenous CTMC iff
Or, the conditional pmf satisfies: A CTMC is said to be irreducible if every state can be reached from every other state, with a non-zero probability. A state is said to be absorbing if no other state can be reached from it with non-zero probability. Notion of transient, recurrent non-null, recurrent null are the same as in a DTMC. There is no notion of periodicity in a CTMC, however.
13
CTMC Dynamics Chapman-Kolmogorov Equation
Note that these transition probabilities are functions of elapsed time and not of the number of elapsed steps The direct use of the this equation is difficult unlike the case of DTMC where we could anchor on one-step transition probabilities Hence the notion of rates of transitions which follows next
14
Transition Rates Define the rates (probabilities per unit time):
net rate out of state j at time t: the rate from state i to state j at time t:
15
Kolmogorov Differential Equation
The transition probabilities and transition rates are, Dividing both sides by h and taking the limit,
16
Kolmogorov Differential Equation (contd.)
Kolmogorov’s backward equation, Writing these eqs. in the matrix form,
17
Homogeneous CTMC Specialize to HCTMC (Kolmogorov diff. eqn) :
In the matrix form, (Matrix Q is called the infinitesimal generator matrix (or simply Generator Matrix))
18
CTMC Steady-state Solution
Steady state solution of CTMC obtained by solving the following balance equations: Irreducible CTMCs with all states recurrent non-null will have +ve steady-state {πj} values that are unique and independent of the initial probability vector. All states of a finite irreducible CTMC will be recurrent non-null. Measures of interest may be computed by assigning reward rates to states and computing expected steady state reward rate: Transient solutions, in general are rather difficult to obtain.
19
CTMC Measures Measures of interest may be computed by assigning reward rates to states and computing expected reward rate at time t: Expected accumulated reward (over an interval of time) Lj(t) is the expected time spent in state j during (0,t) (LTODE) Transient solutions, in general are rather difficult to obtain.
20
Markov Availability Model
21
2-State Markov Availability Model
UP 1 DN 1) Steady-state balance equations for each state: Rate of flow IN = rate of flow OUT State1: State0: 2 unknowns, 2 equations, but there is only one independent equation.
22
2-State Markov Availability Model (Continued)
Need an additional equation: Downtime in minutes per year = * 8760*60
23
2-State Markov Availability Model (Continued)
2) Transient Availability for each state: Rate of buildup = rate of flow IN - rate of flow OUT This equation can be solved to obtain assuming 1(0)=1
24
2-State Markov Availability Model (Continued)
3) 4) Steady State Availability:
25
Markov availability model
Assume we have a two-component parallel redundant system with repair rate . Assume that the failure rate of both the components is . When both the components have failed, the system is considered to have failed.
26
Markov availability model (Continued)
Let the number of properly functioning components be the state of the system. The state space is {0,1,2} where 0 is the system down state. We wish to examine effects of shared vs. non-shared repair.
27
Markov availability model (Continued)
2 1 Non-shared (independent) repair 2 1 Shared repair
28
Markov availability model (Continued)
Note: Non-shared case can be modeled & solved using a RBD or a FTREE but shared case needs the use of Markov chains.
29
Steady-state balance equations
For any state: Rate of flow in = Rate of flow out Consider the shared case i: steady state probability that system is in state i
30
Steady-state balance equations (Continued)
Hence Since We have or
31
Steady-state balance equations (Continued)
Steady-state unavailability = 0= 1 - Ashared Similarly for non-shared case, steady-state unavailability = 1 - Anon-shared Downtime in minutes per year = (1 - A)* 8760*60
32
Steady-state balance equations
33
A larger example Return to the 2 control and 3 voice channels example and assume that the control channel failure rate is c, voice channel failure rate is v. Repair rates are c and v, respectively. Assuming a single shared repair facility and control channel having preemptive repair priority over voice channels, draw the state diagram of a Markov availability model. Using SHARPE GUI, solve the Markov chain for steady-state and instantaneous availability.
35
WFS Example
36
A Workstations-Fileserver Example
Computing system consisting of: A file-server Two workstations Computing network connecting them System operational as long as: One of the Workstations and The file-server are operational Computer network is assumed to be fault-free
37
The WFS Example
38
Markov Chain for WFS Example
Assuming exponentially distributed times to failure w : failure rate of workstation f : failure rate of file-server Assume that components are repairable w: repair rate of workstation f: repair rate of file-server File-server has (preemptive) priority for repair over workstations (such repair priority cannot be captured by non-state-space models)
39
Markov Availability Model for WFS
0,0 2,1 1,1 1,0 2,0 0,1 f 2w w w f Since all states are reachable from every other states, the CTMC is irreducible. Furthermore, all states are positive recurrent.
40
Markov Availability Model for WFS (Continued)
In this figure, the label (i,j) of each state is interpreted as follows: i represents the number of workstations that are still functioning and j is 1 or 0 depending on whether the file-server is up or down respectively.
41
Markov Model Let {X(t), t > 0} represent a finite-state Continuous Time Markov Chain (CTMC) with state space . Infinitesimal Generator Matrix Q = [qij]: qij (i != j) : transition rate from state i to state j qii = - qi= , the diagonal element
42
Markov Availability Model for WFS (Continued)
For the example problem, with the states ordered as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q matrix is given by: Q =
44
Markov Model (steady-state)
: Steady-state probability vector These are called steady-state balance equations rate of flow in = rate of flow out after solving for obtain Steady-state availability
45
Markov Model (transient)
p(t):transient state probability vector p(0): initial probability vector of the CTMC Transient behavior described by the Kolmogorov differential equation (KDE):
46
Markov Availability Model
We compute the availability of the system:System is available as long as it is in states (2,1) and (1,1). Instantaneous availability of the system: ss t A = + ) ( lim 1 , 2 p
47
Availability (Continued)
Interval Availability: Steady-State Availability: There are three kinds of Availabilities! Instantaneous, Interval & Steady-state
48
Markov Availability Model (Continued)
L(i,j)(t): Expected Total Time Spent in State (i,j) during (0,t) Integrating the KDE, we get the LTODE: Interval availability
49
Markov Availability Model Results
50
2-component Availability model with finite Detection delay
Steady state availability Ass = 1-π0 Failure detection stage takes random time, EXP(δ) Down states are ‘0’ and ‘1D’ Ass = 1- π0- π1D Therefore, steady state unavailability U(δ) is given by
51
Redundant System with Finite Detection Switchover Time
After solving the Markov model, we obtain steady-state probabilities: Can solve in closed-form or using SHARPE
52
Closed-form
53
2-component availability model with imperfect coverage
Coverage factor = c (conditional probability that the fault is correctly handled) ‘1C’ state is a reboot (down) state.
54
2-components availability model : delay + imperfect coverage
Model has detection delay + imperfect coverage Down states are ‘0’, ‘1C’ and ‘1D’.
55
Modeling Software Faults Operating System Failure
Availability model with hardware and software (OS) redundancy; operational phase; Heisenbugs Probability & Statistics with Reliability, Queuing and Computer Science Applications (2nd ed.) K. S. Trivedi John Wiley, 2001. Assumptions Hardware failures are permanent A repair or replacement action while OS failures are cleared by a reboot Repair or reboot takes place at rates and for the hardware and OS, respectively.
56
Webserver Availability Model with warm Replication
Two nodes for hardware redundancy Each node has a copy of the webserver (software redundancy– replication) Primary node can fail Secondary node can fail Primary process can fail Secondary process can fail Failures may have imperfect coverage Time delay for fault detection Model of a real system developed at Avaya Labs
57
Modeling Software Faults Application Failure
Availability model with passive redundancy (warm replication) of application; Operational phase; Heisenbugs or hardware transients Assumptions A web server software, that fails at the rate p running on a machine that fails at the rate m Mean time to detect server process failure -1p and the mean time to detect machine failure -1m The mean restart time of a machine -1m The mean restart time of a server -1p Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault-Tolerance S. Garg, Y. Huang, C. Kintala, K. S. Trivedi and S. Yagnik Proc. of the 29th Intl. Symp. On Fault-Tolerant Computing, FTCS-29, June 1999.
58
Parameters Process MTTF = 10 days (1/p) Node MTTF = 20 days (1/n)
Process polling interval = 2 seconds (1/p) Mean process restart time = 30 seconds (1/p) Mean process failover time = 2 minutes (1/n) Switching time with mean 1/ s C = 0.95
59
Solution for warm replication
60
Modeling an N+1 Protection System
61
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
62
Description of the system
N = Number of protected units (we use N=1) = Unit failure rate = Unit restoration rate T = deterministic time between routine diagnostics c = Probability that a protection switch successfully restores service d = Probability that a failure in the standby unit is detected
63
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
64
Hot Standby with different coverages
Normal (1+1) (1-c) (1-d) (c+d) Failure to Detect Protection Fault Protection Switch Failure Simplex (1) 2 Normal: 1 Protection Switch Failure: 2 Simplex: 3 Failure to detect protection fault: 4 Failed: 5 Failed (0) N=1
65
Diagnostics; Using a rate approximation
Normal (1+1) (1-c) (1-d) (c+d) Failure to Detect Protection Fault Protection Switch Failure Simplex (1) 2/T 2 Normal: 1 Protection Switch Failure: 2 Simplex: 3 Failure to detect protection fault: 4 Failed: 5 Failed (0) Time to diagnostic is exponentially distributed with mean T/2 N=1
67
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
68
Comparison of probability density functions (pdf)
69
Comparison of cumulative distribution functions (cdf)
70
Using a 3-stage Erlang approximation to a uniform distribution
Normal (1+1) (1-d) (1-c) (c+d) Failure to Detect Protection Fault Protection Switch Failure Simplex (1) s2 s1 6/T 2 6/T 6/T Time to diagnostic is uniformly distributed over (0,T) - approximated by a 3-stage Erlang with mean T/2 Failed (0)
72
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
73
Using a Semi-Markov model - approximation method using an Erlang distribution (N=1)
E(t) -> 3-stage Erlang distribution given by, Normal (1+1) (1-c) (1-d) (c+d) Failure to Detect Protection Fault Protection Switch Failure Simplex (1) E(t) Time to diagnostic is uniformly distributed over (0,T) - approximated by a 3-stage Erlang distribution with mean T/2 2 Failed (0)
74
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
75
Using Equations of the underlying Semi-Markov Process
Steady state solution One step transition probability matrix, P of the embedded DTMC
76
Using Equations of the underlying Semi-Markov Process (Continued)
77
Using Equations of the underlying Semi-Markov Process (Continued)
Time to the next diagnostic is uniformly distributed over (0,T)
78
Using Equations of the underlying Semi-Markov Process (Continued)
79
Outline Description of the system Using a rate approximation
Using a 3-stage Erlang approximation to a uniform distribution Using a Semi-Markov model - approximation method using a 3-stage Erlang distribution Using equations of the underlying Semi-Markov Process Solutions for the models
80
Solutions for the models
Parameter values assumed: N = 1 c = 0.9 d = 0.9 = / hour = 1 / hour T = 1 hour
81
Results obtained Steady state availability Steady state unavailability
Probability of being in states “Normal”, “Simplex”, or “Failure to Detect Protection Fault” Steady state unavailability Probability of being in states “Protection Switch Failure”, or “Failed (0)” Average downtime in steady state Steady state unavailability * Number of minutes in a year Average #units available 2*PNormal + 1*PSimplex +1*PFailuretoDetectProtectionFault
83
Markov Reliability Model
84
Markov reliability model with repair
Consider the 2-component parallel system (no delay + perfect cov) but disallow repair from system down state Note that state 0 is now an absorbing state. The state diagram is given in the following figure. This reliability model with repair cannot be modeled using a reliability block diagram or a fault tree. We need to resort to Markov chains. (This is a form of dependency since in order to repair a component you need to know the status of the other component).
85
Markov reliability model with repair (Continued)
Absorbing state Markov chain has an absorbing state. In the steady-state, system will be in state 0 with probability 1. Hence transient analysis is of interest. States 1 and 2 are transient states.
86
Markov reliability model with repair (Continued)
Assume that the initial state of the Markov chain is 2, that is, p2(0) = 1, pk (0) = 0 for k = 0, 1. Then the system of differential Equations is written based on: rate of buildup = rate of flow in - rate of flow out for each state
87
Markov reliability model with repair (Continued)
88
Markov reliability model with repair (Continued)
After solving these equations, we get R(t) = p2(t) +p1(t) Recalling that , we get:
89
Markov reliability model with repair (Continued)
Note that the MTTF of the two component parallel redundant system, in the absence of a repair facility (i.e., = 0), would have been equal to the first term, 3 / ( 2* ), in the above expression. Therefore, the effect of a repair facility is to increase the mean life by / (2*2), or by a factor
90
Markov Reliability Model with Repair ( WFS Example)
Assume that the computer system does not recover if both workstations fail, or if the file-server fails
91
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1) are transient states. Note: we have made a simplification that, once the CTMC reaches a system failure state, we do not allow any more transitions.
93
Markov Model with Absorbing States
If we solve for p(2,1)(t) and p(1,1)(t) then R(t)= p(2,1)(t) + p(1,1)(t) For a Markov chain with absorbing states: A: the set of absorbing states B = - A: the set of remaining states t(i,j): Mean time spent in state i,j until absorption
94
Markov Model with Absorbing States (Continued)
QB derived from Q by restricting it to only states in B Mean time to absorption MTTA is given as:
95
Markov Reliability Model with Repair (Continued)
[ ] First Solve
96
Markov Reliability Model with Repair (Continued)
Then : next solve Then : Mean time to failure is hours.
97
Markov Reliability Model without Repair
Assume that neither workstations nor file-server is repairable
98
Markov Reliability Model without Repair (Continued)
States (0,1), (1,0) and (2,0) become absorbing states
101
Markov Reliability Model without Repair (Continued)
] [ Mean time to failure is 9333 hours.
102
Markov Reliability Model with Imperfect Coverage
103
Markov model with imperfect coverage
Next consider a modification of the above example proposed by Arnold as a model of duplex processors of an electronic switching system. We assume that not all faults are recoverable and that c is the coverage factor which denotes the conditional probability that the system recovers given that a fault has occurred. The state diagram is now given by the following picture:
104
Now allow for Imperfect coverage
105
Markov model with imperfect coverage (Continued)
Assume that the initial state is 2 so that: Then the system of differential equations are: p ( ) = 1 , p ( ) = p ( ) = 2 1 dp ( t ) ) ( 1 2 t p c dt dp cp l m + - =
106
Markov model with imperfect coverage (Continued)
After solving the differential equations we obtain: R(t)=p2(t) + p1(t) From R(t), we can system MTTF: It should be clear that the system MTTF and system reliability are critically dependent on the coverage factor.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.