Download presentation
Presentation is loading. Please wait.
Published byEileen Price Modified over 9 years ago
1
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006
2
Target: Systems where availability is crucial SMP Commercial Servers: Application Services, Database Management Systems Motivation: Increase in Performance => Decrease in feature size => Decrease in Reliability Cost of fault-tolerant solution: Important
3
Approach and Challenges Decouple: Local Fault Detection - ECC, timeout, etc. Lightweight & Global Fault Recovery - SafetyNet Challenges for lightweight recovery schemes: Amount of storage (checkpoints logs) Maintain consistent global recovery point Advance global recovery point
4
SafetyNet: High-Level View Maintain per processor checkpoints: One globally validated recovery point Multiple coordinated checkpoints pending validation ID by global logical timestamp Fault detected => recover state to Recovery Point (Global)
5
Solutions: Storage Checkpoint architectural state: Registers: Shadow registers or cached copies Copy once on beginning of checkpoint Memory and Caches: Checkpoint Log Buffers (CLBs) Log incrementally stores, ownership change Log only first update per block per checkpoint
6
Solution: Global Coherence Logical Time Base: General agreement on checkpoint interval for each coherence transaction Loosely synchronous checkpoint clock Maintain per block Checkpoint number (CN)
7
Solution: Global Recovery Point Checkpoint Validation: All agree execution to that point Error Free Broadcast new Recovery Point Checkpoint Number Restart: Drain interconnection network Discard in progress coherence state Processors: restore register checkpoint Memory: undo actions in Checkpoint Log Buffers (CLBs) Caches: undo CLB
8
Evaluation: Performance Impact
9
Evaluation: Sensitivity
10
Evaluation: Sensitivity (Cont)
11
Questions Why is having a coordinated checkpoint important? Why broadcast Recovery Point Checkpoint Number twice: when advancing the recovery point when triggering recovery? Why a Sequential Consistent model? Is the scheme valid for Processor Consistency? Is this a good idea? Has it caught on?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.