Download presentation
Presentation is loading. Please wait.
1
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault is detected, a fault location or diagnosis procedure is triggered. The faulty processor is then replaced by a spare processor or spare processing capability through reconfiguration. Finally, error recovery is performed, whereby the spare processor, using typically checkpointed information, takes over the computations of the faulty processor from where it left off.
2
7. Fault Tolerance Through Dynamic or Standby Redundancy In summary, Dynamic Redundancy is performed in 3 steps: I. Fault detection and location II. Reconfiguration of the system around the faulty processor III. Error recovery
3
7. Fault Tolerance Through Dynamic or Standby Redundancy fault detection Several approaches perform fault detection in multiprocessors: Scheduled off-line testing for permanent faults * Duplication and comparison * * Diagnostics and coding techniques * * * Described next...
4
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.1 Fault Detection Through Duplication and Comparison –A) –A) Each processor of the multiprocessor can be duplicated, and the results compared before communicating to the processor pairs.
5
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.1 Fault Detection Through Duplication and Comparison –B) –B) Another approach is dividing the P processors of a multiprocessor into P/2 pairs. The global memory which consists of M memory modules can either be divided into M/2 pairs. Comparators can be kept inside each processor and memory module, and results of both computations must match for an operation to be executed. If an error is detected by a processor pair, both processors of the pair are powered off, and the computations are able to proceed on the P- 2 remaining processors, configured as (P-2)/2 pairs of processors.
6
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.1 Fault Detection Through Duplication and Comparison –C) –C) Alternatively, the comparison operation can also be performed in software, by means of checkpoint comparison techniques.
7
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.1 Fault Detection Through Duplication and Comparison –D) –D) Finally, the duplication and comparison operation can be performed by means of time redundancy. This is useful when one cannot afford the redundancy of duplication for cost, weight, power, and space constraints (e.g., embarked, battery-powered electronics). In the presence of task dependencies (see example), one often finds processors that are idle, since there are no ready tasks. In such situations, one can map the original task graph on P/2 processors, get better processor utilization, and use the remaining P/2 processors to perform the duplicate computation of the task graph. Hence, in real task graphs, one can observe less than 100% time overhead.
8
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.1 Fault Detection Through Duplication and Comparison 5 1234 6 7 1234567- a) Original task graph mapping 5 1234 6 7 12,53,74,61d2d,5d3d,7d4d,6d b) Example of mapping duplicated task graphs on disjoint sets of processors 5d 2d3d4d 6d 7d 1d Tasks Processors
9
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.1 Fault Detection in Multiprocessors –7.1.2 Fault Detection Using Diagnostics and Coding Techniques See “2.2 Information Redundancy”
10
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.2 Recovery Strategies for Multiprocessor Systems Since most faults are transient or intermittent, s simple recovery procedure may be merely to reexcute the computation. Recovery issues are more complex in distributed systems (communicating processes): one has to ensure that the correct execution of one process is not affected by the faulty execution of a communicating process. Recovery techniques are different for distributed- and shared-memory multiprocessors: multiple processes can access memory and have different or erroneous copies of the same variables, creating an inconsistent state when the error is detected. Therefore, some scheme must be devised that will be able to store enough error-free processor state information at a reliable place from where it can be retrieved and used to restart the program (rollback recovery) from a consistent state, in the event of a transient failure in one or more processors during program execution.
11
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.2 Recovery Strategies for Multiprocessor Systems Checkpointing The most popular scheme: Checkpointing ! It involves storing as much information about the processor state as necessary at discreet points (checkpoints, or rollback points) in the program to ensure that the program can be rolled back to those points in the event of a node failure, and restarted from there, as though no fault had occurred. Processor states: varies from one system to another. Generally it involves the register set of the processor, the program counter, the state of cache, and even memory as well, or at least those parts of it that have been altered by the processor since the last checkpoint. This information is stored in reliable storage, that is, memory assumed not to fail. Such a memory could be a disk, or memory protected by using error-correcting codes, or duplicated memory and/or registers.
12
Rollback recovery using checkpoints is a very cost- effective method of providing fault tolerance against transient and intermittent faults. Various implementations and overhead issues are illustrated in the following. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints
13
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints –7.3.1 Processor Cache-Based Checkpoints CPU Register Cache Main Memory CPU Register Save Area Active State Checkpoint State Processor-based checkpoint and rollback recovery. Fault-tolerant techniques to flush cache. T a1 data T a2 Bank A 1: T a1 3: T a2 2: Flush T b1 data T b2 Bank B 4: T b1 6: T b2 5: Flush
14
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints –7.3.1 Processor Cache-Based Checkpoints ConditionFailureAction Ta1 =Ta2 = Tb1 = Tb2 None Ta1 >Ta2 = Tb1 = Tb2 Flush ACopy Bank B to A Ta1 =Ta2 > Tb1 = Tb2 BetweenCopy Bank A to B Ta1 =Ta2 = Tb1 > Tb2 Flush BCopy Bank A to B Failure Conditions.
15
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints –7.3.2 Virtual Checkpoints k j Checkpoint (v < V) Active (v = V) Checkpoint Virtual Memory Real Memory Paging Disk Basic Concept. Overview of Single Page Mapping. r0 r1 d0 d1 m0 m1 lv
16
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints –7.3.2 Virtual Checkpoints Case 1: First reference after checkpoint. Case 2: Page previously referenced. tc1tc2 V = 0V = 1V = 2 tr0: m0 checkpoint tr1: m1 Active v = 1 tr2 tc2 V = 1V = 2 tr1: m1 checkpoint tr2: m2 Active v = 2 tr3
17
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.3 Rollback Recovery Using Checkpoints –7.3.2 Virtual Checkpoints Pri Primary Process Backup Process Checkpointed State Pri Primary Process Backup Process “I am alive” Primary process checkpoints the state with the backup process. “I am alive” messages are used for fault detection.
18
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating......Multiprocessors –7.4.1 Shared-Memory Multiprocessors Bus lines. Bus LineSet by Processor to Indicate... Sharedsharing a block on the bus. Establish Rollback Point that a rollback point is being established. Rollbackthat it is backing up to the prior rollback point.
19
: Checkpoint : Communication 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating......Multiprocessors –7.4.2 Distributed-Memory Multiprocessors Domino effect in recovery of multiprocesses. P1 P2
20
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating......Multiprocessors –7.4.2 Distributed-Memory Multiprocessors Consistent and inconsistent recovery lines.
21
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating Multiprocessors –7.4.3 Recovery in Distributed Shared-Memory Systems –Typically, distributed shared-memory (DSM) systems are loosely coupled, geographically distributed systems of processors, each processor with its own memory. –Implemented by using Virtual Memory: programmers see a single shared memory, which in reality is made up of individual memories residing in different processors. –Pages are used as the basic blocks of memory transfer. Each node keeps in its own local memory a subset of the total number of pages from the shared virtual memory. –A page fault is generated whenever a node tries to access a nonresident page. A page request is then generated and sent to a distinguished owner node that has a copy of the page needed. Upon reception of the page request, the owner node transfers the new page to the requester, which then becomes the new owner. –An owner-node keeps a page-table with information on the nodes which have read-only copies of pages that are owned by the owner-node.
22
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating Multiprocessors –7.4.3 Recovery in Distributed Shared-Memory Systems –A local checkpoint for the mentioned system consists of (the important information that must be saved): –a) the contents of locally owned pages that have been modified since the last checkpoint on the local node. –b) the page-table entries for locally owned pages that have been modified since the last checkpoint. –This is in addition to the state information of the local processor, which is also stored with each checkpoint in reliable storage. –How the reliable storage is implemented depends upon the resources available, as well as on the level of reliability desired from the system. –A process on a recovering processor is expected to retrieve any clean pages that it might need from previous checkpoints stored on disk, in addition to any dirty pages that were stored in the last checkpoint before failure.
23
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating......Multiprocessors –7.4.4 Recovery in Database Systems –Database systems employ atomic actions known as transactions to maintain consistency and integrity in the presence of concurrent activities. –Since transactions are atomic activities, in the event a transaction is aborted, its actions have to be undone to restore consistency to the system. –Because of the “all-or-nothing” property of atomic actions, an important amount of work might be abandoned needlessly when an internal error is encountered.
24
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.4 Rollback Recovery in Communicating Multiprocessors –7.4.4 Recovery in Database Systems –Shadowing is a typical implementation of recovery-oriented mechanism on database systems, which involves using a new disk page to write the modified version of a database page. When the transaction completes (or commits), the page to which it was writing becomes the permanent page, or it is discarded if the transaction aborts. Recovery is fast, since it only involves discarding the modified pages into which the transactions in the active list are writing. –Thus, a scheme for distributed systems has been considered which uses pages as the invisible unit of memory that is stored as part of a checkpoint and used for recovery.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.