1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan
2 Flips Happen Similar die area + Decreasing transition energy = Increasing risk of transient failure
3 Multi-Processors & Virtual Machine Multi-Processor Ensure error independence Enable fault detection Efficient resource sharing Virtual Machine No changes to OS or applications VM replay Synchronize replicas Recover correct state Replica 1Replica 2 Hypervisor Device Drivers Replication Management Layer (RML)
4 Example: Memory Copy on write Reduces overhead Protects checkpoints Merge on checkpoint Verify correctness Re-execute on deviation Memory Fault Protection ECC against RAM faults MMU against CPU faults MemoryCheckpointReplica 1CheckpointReplica 2 A B C D E A B C X E A B C E Verify Replica 3 A B C D E
5 Status Present VM Replay Beginnings of Replication Management Layer (RML) Still much to do… Future Replicate the un-replicated Handle faults in device drivers Expanded fault model Replica 1Replica 2 Hypervisor/RML Device Drivers
6 Questions?