Download presentation
Presentation is loading. Please wait.
Published byMariah Matthews Modified over 9 years ago
1
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by Akin Olugbade03/05/2010
2
Motivation Increase in processor speed and decrease in processor technology size make chips more susceptible to errors Systems need high availability Shared memory multiprocessor servers make up a lot of internet servers Rebooting or system crashes are an undesirable way to deal with errors
3
SafteyNet Design Create globally consistent checkpoints that the system can recover to in the case an error is detected Save architected state which consists of processor registers, memory state, coherence state Validate that a checkpoint is fault free Recover to most recent validated checkpoint in case of error
4
SafetyNet Design Logging space reduced Only log changes to a certain register, memory block, or coherence permission once per checkpoint interval Point of Atomicity Requestor does not increment recovery point until all outstanding requests are completed Consistent logical time ensures global consistency of checkpoints Validation All components must agree that a checkpoint is a valid fault free point for it to be validated
5
Logical Time
6
Evaluation
8
Conclusion + Checkpoint/Recovery system can be independent of error detection mechanism +Negligible performance overhead in error free common case +Storage and Bandwidth overhead can be minimized greatly by increasing checkpoint interval
9
Questions Does the Validation Latency matter in the case of output commit? How do we deal with stores in the case of CLB fillup? Is SafteyNet suitable for mission critical situations? If our validation time is fast enough, would we want to reduce the checkpoint interval time?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.