Download presentation
Presentation is loading. Please wait.
1
1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan
2
2 Flips Happen Similar die area + Decreasing transition energy = Increasing risk of transient failure
3
3 Multi-Processors & Virtual Machine Multi-Processor Ensure error independence Enable fault detection Efficient resource sharing Virtual Machine No changes to OS or applications VM replay Synchronize replicas Recover correct state Replica 1Replica 2 Hypervisor Device Drivers Replication Management Layer (RML)
4
4 Example: Memory Copy on write Reduces overhead Protects checkpoints Merge on checkpoint Verify correctness Re-execute on deviation Memory Fault Protection ECC against RAM faults MMU against CPU faults MemoryCheckpointReplica 1CheckpointReplica 2 A B C D E A B C X E A B C E Verify Replica 3 A B C D E
5
5 Status Present VM Replay Beginnings of Replication Management Layer (RML) Still much to do… Future Replicate the un-replicated Handle faults in device drivers Expanded fault model Replica 1Replica 2 Hypervisor/RML Device Drivers
6
6 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.