1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

Slides:



Advertisements
Similar presentations
Operating Systems Components of OS
Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Live Migration of Virtual Machines Presented by: Edward Armstrong University of Guelph.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Memory-efficient Virtual Machine High Availability Karen Kai-Yuan Hou Prof. Kang G. Shin University of Michigan Mustafa Uysal (VMware) Arif Merchant (HP.
Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Virtual Memory Primitives for User Programs Andrew W. Appel and Kai Li Presented by: Khanh Nguyen.
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
G Robert Grimm New York University Disco.
W4118 Operating Systems OS Overview Junfeng Yang.
The ghost of intrusions past Ashlesha Joshi Peter M. Chen University of Michigan 7 December 2004.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy
Remus: High Availability via Asynchronous Virtual Machine Replication.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
5205 – IT Service Delivery and Support
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Review Session for Fourth Quiz Jehan-François Pâris Summer 2011.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Remus: VM Replication Jeff Chase Duke University.
Improving Network I/O Virtualization for Cloud Computing.
D u k e S y s t e m s Pocket Hypervisors: Opportunities and Challenges Peter Chen University of Michigan Landon Cox Duke University.
Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
CS533 Concepts of Operating Systems Jonathan Walpole.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
G53SEC 1 Reference Monitors Enforcement of Access Control.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
Supporting Multi-Processors Bernard Wong February 17, 2003.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Chapter 2 Introduction to OS Chien-Chung Shen CIS, UD
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Operating Systems Security
VMWare MMU Ranjit Kolkar. Designed for efficient use of resources. ESX uses high-level resource management policies to compute a target memory allocation.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
This project has received funding from the European Union's Seventh Framework Programme for research, technological development.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Chapter 2 Introduction to OS Chien-Chung Shen CIS/UD
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
Virtualization Neependra Khare
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Virtualization.
Presented by Yoon-Soo Lee
Java Win32 native Java VM Linux OS ARM VM Runtime Windows OS
CS490 Windows Internals Quiz 2 09/27/2013.
TexPREP Summer Camp Computer Science
Introduction to Operating Systems
Mobile and Desktop Memory Management
آشنايی با اصول و پايه های يک آزمايش
Virtualization Techniques
Lecture 6: Reliability, PCM
Architectural Support for OS
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Architectural Support for OS
CS533 Concepts of Operating Systems Class 14
Presentation transcript:

1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan

2 Flips Happen Similar die area + Decreasing transition energy = Increasing risk of transient failure

3 Multi-Processors & Virtual Machine Multi-Processor  Ensure error independence  Enable fault detection  Efficient resource sharing Virtual Machine  No changes to OS or applications  VM replay Synchronize replicas Recover correct state Replica 1Replica 2 Hypervisor Device Drivers Replication Management Layer (RML)

4 Example: Memory Copy on write  Reduces overhead  Protects checkpoints Merge on checkpoint  Verify correctness  Re-execute on deviation Memory Fault Protection  ECC against RAM faults  MMU against CPU faults MemoryCheckpointReplica 1CheckpointReplica 2 A B C D E A B C X E A B C E Verify Replica 3 A B C D E

5 Status Present  VM Replay  Beginnings of Replication Management Layer (RML)  Still much to do… Future  Replicate the un-replicated  Handle faults in device drivers  Expanded fault model Replica 1Replica 2 Hypervisor/RML Device Drivers

6 Questions?