Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Similar presentations


Presentation on theme: "Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing."— Presentation transcript:

1 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing II

2 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.2 Checkpoint Placement - Notations  Checkpoint placement - tradeoff between cost and benefits - aimed at minimizing the expected execution time of a long job  Cost - time to store a checkpoint - can be large  t - execution time without checkpoints  t - average time of taking checkpoint  N - decision variable - number of checkpoints placed uniformly in job - minimizing total execution time T (N)   = t / N - time between consecutive checkpoints  Failures occur with rate  Failures are transient - go away after a mean lifetime t  System then rolls back to the latest checkpoint  Checkpoints in secure memory - uncorrupted by failure x c f x x tot

3 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.3 Checkpoint Placement – Analytical Model  t - total time lost for every transient failure  t - time system is down  If failure occurs during checkpointing  probability p = t /(t +  )  lost time  + t /2  If failure occurs during execution  probability p =  /(t +  )  lost time  /2  t =t + p (  +t /2) + p  /2 =t + (t +  )/2 This result is intuitive - (t +  )/2 is half the interval t +  l f c cc x c x x c x x x l f f c c c x x x x c c x x

4 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.4 Optimal Checkpoint Placement  Assume is sufficiently small so that probability of failure during rollback is negligible  Expected number of failures during the total execution time of t + N t is (t + N t )  Total time taken -  T (N) =t +N t + [t + N t ][t +(t +t /N )/2]  Select N so as to minimize T (N)   T (N) /  N = t + t (t /2+t )-( t )/(2N )  Setting derivative to zero, we obtain  N = t  /  t (2 + t +2 t ) c x x x 2 x x c c c c c f x tot x _ cc f 2 opt ________ c c f __

5 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.5 Optimal Value of N  N must be a whole number - the one out of floor or ceiling that minimizes T (N)  Optimal inter-checkpoint interval -  =t / N  Exercise - Relax the assumption that the probability of additional failures during the recovery process is negligible  Uniform placement - optimal if checkpointing cost is constant throughout the execution  If checkpoint size - and hence checkpointing time - varies greatly from one part of the execution to the other - optimal time between checkpoints is not constant tot opt x

6 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.6 Optimal Checkpoint Placement - An Instruction Level Model  Probability of a fault during an instruction execution depends on the functional units used and its execution time  Decision variable M - number of instructions between consecutive checkpoints  Minimizing W - time spent per instruction  Instruction set partitioned into N subsets of similar instructions  For a type i instruction - execution time T, failure rate, frequency f (  f =1)  s (1-s) - fraction of permanent (transient) faults   - “repair” rate of a transient failure in a type i instruction i i i ii

7 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.7 Instruction Level Model - Notations  Possible events during an instruction execution:  H - Instruction is completed successfully when first executed - probability P  H - Instruction fails, failure identified, program rolled-back to last checkpoint, instruction completed - probability P  H - Program rollback fails, program fails, program reloaded and restarted - probability P  P, P, P - conditional probabilities for a type i instruction  These conditional probabilities will be calculated and then averaged: c RB PF RB c c PF i i i

8 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.8 Instruction Level Model - Further Notations  For a system with failure rate and repair rate  -  Probability of no faults in the time interval (0,t)  Probability of transition from the fault-free state at time 0 to the fault-free state at time t  For 0  t  t, probability of transition from the fault-free state at time 0 to the fault-free state at time t with at least one fault during (0,t ) 1 1

9 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.9 Instruction Level Model - Probabilities  M - number of instructions between checkpoints  m - number of instructions between failing instruction and last checkpoint =1,…,M with probability 1/M each  P - conditional probability of successful rollback given type i and m instructions to the last checkpoint   - setup time needed to initiate a program rollback including the time needed to load the information saved in the last checkpoint RB i,m 1

10 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.10 Instruction Level Model - Calculating W   - mean time to successfully execute an instruction  T - time taken for checkpointing  Time spent per instruction - W =  + T / M  Increasing M - first term increases, second decreases  T =  f T - mean execution time of a fault-free instruction   - average time required for diagnose and repair  L - average number of instructions per program   includes W as a term   is substituted in W and solved for W s s i i 2

11 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.11 Optimal value of M  Solving for W -  Finding the optimal value for M which minimizes W -done iteratively  Initial value is obtained by substituting 1 for the denominator and 0 for , taking the derivative with respect to M and letting it equal 0  Initial value of M - 2

12 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.12 CARER: Cache-Aided Rollback Error Recovery  Reducing checkpointing time allows more frequent checkpoints - reducing penalty of rollback upon failure  The CARER scheme reduces the time required to take a checkpoint by marking the process footprint in main memory and cache as parts of the checkpointed state  Assuming that the memory and cache are far less prone to failure than the processor  Checkpointing consists of storing the processor's registers in main memory, and includes the processes' footprint in main memory plus any lines of the cache marked as being part of the checkpoint

13 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.13 Checkpoint Bit For Each Cache Line  This requires hardware modification: an extra checkpoint bit associated with each cache line  When this bit is 1: the corresponding line is unmodifiable, i.e., the line is part of the latest checkpoint - may not update without being forced to take a checkpoint immediately  If the bit = 0: processor is free to modify the word  The process' footprint in main memory, and marked lines in the cache do double duty as both memory and part of checkpoint - less freedom when deciding when checkpoints have to be taken

14 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.14 Forced Checkpointing  Checkpointing is forced when  A line marked unmodifiable is to be updated  Anything in the main memory is to be updated  An I/O instruction is executed or an external interrupt occurs  Taking a checkpoint involves:  (a) saving the processor registers in memory  (b) setting to 1 the checkpoint bit associated with each valid cache line

15 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.15 Roll Back - Cost  Rolling back to the previous checkpoint is very simple: restore the registers, and mark invalid all the lines in cache with checkpoint bit = 0  The cost:  A checkpoint bit for every cache line  Every write-back of a cache line into main memory involves taking a checkpoint

16 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.16 Checkpointing in Distributed Systems  Distributed system: processors and associated memories connected by a network  Each processor may have local disks  Can be a network file system accessible by all processors  Processes connected by directional channels - point-to-point connections from one process to another  Assume channels are error-free and deliver messages in the order received

17 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.17 Deterministic and Non-deterministic Events  A non-deterministic event: its occurrence cannot be predicted based on prior state(s) of system  A deterministic event can be predicted  Process execution is a sequence of deterministic events, interrupted now and then by some non- deterministic events  Example: a program controlling a pressure valve of a chemical reactor - an endless loop with frequent inputs from pressure sensors - then making control decisions  The value of an input is a non-deterministic event: cannot predict it based on the results of prior processing

18 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.18 Piecewise Deterministic Process  However, once the input is available, the rest is predictable (assuming no failures)  A process execution can be regarded as piecewise deterministic:  It consists of time-slices, each of which begins with some non-deterministic event  Given information about the non-deterministic event and the state of the process at the beginning of that time-slice, we can predict every event that happens during the time-slice

19 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.19 Process/Channel/System State  The state of a process has the obvious meaning;  The state of the channel at t is the set of messages carried by it up to time t (and the order of receipt)  The state of the distributed system is the aggregate states of individual processes and channels  The state is said to be consistent if, for every message delivery there is a corresponding message- sending event  A state violating this - a message delivered that had not yet been sent - violating causality; such a message is called an orphan  The converse is consistent - a system state reflect the sending of a message but not its receipt

20 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.20 Consistent and Inconsistent States  Two processes, P and Q, each has two checkpoints taken; Message m is sent by P to Q  Sets of checkpoints representing consistent system states:  {P_1, Q_1}: Neither checkpoint has any information about m  {P_2, Q_2}: P_2 indicates that m was sent; Q_2 indicates that it was received  {P_2, Q_1}: P_2 indicates that m was sent; Q_1 has no record of receiving m

21 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.21 Inconsistent States  In contrast, the set {P_1, Q_2} is an inconsistent state; P_1 has no record of m being sent,while Q_2 records that m was received, i.e., m is an orphan message  The sets of checkpoints that represent a consistent system state are said to form a recovery line - we can roll the system back to them and restart from there:  {P_1, Q_1}: Rolling back P to P_1 undoes the sending of m and rolling back Q to Q_1 means that Q does not have any record of m  Restarting from these checkpoints, P will again send out m, which Q will receive

22 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.22 Inconsistent States - Cont.  {P_2, Q_1}: Rolling back P to P_2 means that it will not retransmit m; however, rolling back Q to Q_1 means that Q has no record of ever having received m  The recovery process has to be able to play back m to Q - can be done by adding it to the checkpoint of P or having a separate message log, recording everything received by Q  Sometimes, checkpoints can be useless - they will never form part of a recovery line, so that taking them is a waste of time

23 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.23 Useless Checkpoints  Q_2 is a useless checkpoint  Q_2 records the receipt of m1, but not the sending of m2  {P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)

24 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.24 The Domino Effect  If checkpoints are not coordinated (directly - message passing or indirectly - synchronized clocks): a single failure can cause a domino effect  When P suffers a transient failure, it rolls back to checkpoint P_3  Since message f was sent after P_3, Q has to roll back (otherwise Q would have a message that was never sent: an orphan message)  P will rollback to P_2 since Q sent a message e to P  This continues until all processes have rolled back to their starting positions failure

25 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.25 Lost Message  Messages lost due to rollback:  Suppose Q rolls back to Q_1 after receiving message x from P  Record of having received x is lost  If P does not roll back to P_2 - as if P had sent a message which was never received by Q  Lost messages do not violate causality - similarly to messages lost due to network problems  Retransmission  However, if Q sent an ACK of x to P before rolling back, then that ACK will be an orphaned message unless P rolls back to P_2

26 Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.26 Example of Livelock  Livelock - another problem that can arise in distributed checkpointed systems  Q sends P a message q1 P sends Q a message p1  Then, P fails at the point shown, before receiving q1. To prevent p1 from being orphaned, Q must roll back to Q_1  In the meantime, P recovers, rolls back to P_2, sends another copy of p1, and then receives the copy of q1 that was sent before all the rollbacks began  However, since Q has rolled back, this copy of q1 is now orphaned, and so P has to repeat its rollback  This in turn, orphans the second copy p1 as well, forcing Q to also repeat its rollback  This dance of the rollbacks can continue indefinitely


Download ppt "Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing."

Similar presentations


Ads by Google