SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006

Target: Systems where availability is crucial SMP Commercial Servers: Application Services, Database Management Systems Motivation: Increase in Performance => Decrease in feature size => Decrease in Reliability Cost of fault-tolerant solution: Important

Approach and Challenges Decouple: Local Fault Detection - ECC, timeout, etc. Lightweight & Global Fault Recovery - SafetyNet Challenges for lightweight recovery schemes: Amount of storage (checkpoints logs) Maintain consistent global recovery point Advance global recovery point

SafetyNet: High-Level View Maintain per processor checkpoints: One globally validated recovery point Multiple coordinated checkpoints pending validation ID by global logical timestamp Fault detected => recover state to Recovery Point (Global)

Solutions: Storage Checkpoint architectural state: Registers: Shadow registers or cached copies Copy once on beginning of checkpoint Memory and Caches: Checkpoint Log Buffers (CLBs) Log incrementally stores, ownership change Log only first update per block per checkpoint

Solution: Global Coherence Logical Time Base: General agreement on checkpoint interval for each coherence transaction Loosely synchronous checkpoint clock Maintain per block Checkpoint number (CN)

Solution: Global Recovery Point Checkpoint Validation: All agree execution to that point Error Free Broadcast new Recovery Point Checkpoint Number Restart: Drain interconnection network Discard in progress coherence state Processors: restore register checkpoint Memory: undo actions in Checkpoint Log Buffers (CLBs) Caches: undo CLB

Evaluation: Performance Impact

Evaluation: Sensitivity

Evaluation: Sensitivity (Cont)

Questions Why is having a coordinated checkpoint important? Why broadcast Recovery Point Checkpoint Number twice: when advancing the recovery point when triggering recovery? Why a Sequential Consistent model? Is the scheme valid for Processor Consistency? Is this a good idea? Has it caught on?

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Similar presentations

Presentation on theme: "SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Similar presentations

Presentation on theme: "SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,"— Presentation transcript:

Similar presentations

About project

Feedback