Download presentation
Presentation is loading. Please wait.
Published byPhoebe Park Modified over 6 years ago
1
Co-designed Virtual Machines for Reliable Computer Systems
University of Wisconsin – Madison September, 2002
2
Outline Overview of VMFTC – Dual-Mode Execution Virtual Machine for Fault Tolerance Computing Hardware: Processor Micro-Architecture Hooks Hardware: Memory System Design Hardware: Interconnection, I/O Channels and Disks and Networking Software: Virtual Machine Monitor VMFTC Simulation Design Problems and Feedback needed 4/22/2019 3:23 AM vmftc
3
Principles Hardware Resource Replication is a must for both Fault Tolerance and Performance/ Throughput. Build simple HW hooks with Replicated hardware resources and then Export either a performance VM or FTC VM. Maintain Conventional Architectural Interface between HW and SW. e.g. between HW and OS, Exploit COTS if possible. – Is this strategy problematic? 4/22/2019 3:23 AM vmftc
4
Architecture - End Users’ Perspective:
Two modes of virtual machines running on the same hardware platform + hidden mini VMM software. Performance Mode VM: Fully Architected COTS processor resources, wider interconnection bandwidth, larger memory, wider I/O channel bandwidth, larger or more disks, and finally lower latency. Reliable Mode VM: Ultra-Reliable Architected processor resources, ultra reliable and available interconnections, memory systems, I/O channels and storage. Positive synergetic effects: Self-monitoring, Self healing via Error detection and recovery mechanisms in Reliable mode VM. Higher system throughput to alleviate workload pressures on the whole server system via performance mode virtual machine. 4/22/2019 3:23 AM vmftc
5
Architecture - End Users’ Perspective:
4/22/2019 3:23 AM vmftc
6
Proposed Contributions:
Flexible and cost effective usage of replicated COTS hardware resources via virtual machine technology to maintain conventional Architecture. Ultra-Reliable architectural interface to software community – separate hardware RAS from that of software for easier hierarchical solutions. Provide simple architectural support for software RAS mechanisms when needed for more effective whole system solutions. 4/22/2019 3:23 AM vmftc
7
Processor Micro-Architecture Hooks
Performance Mode VM : More architected processing capacity. Reliable Mode VM : RAS promised with lock stepped processor pairs. Can monitor system runtime hardware status. Dynamic Switching between the two modes. Bootstrap – Reliable mode for better self testing. Power-off – VMM gets final control of the system, enter reliable mode to make sure everything is OK before power off. 4/22/2019 3:23 AM vmftc
8
Processors: Lock-stepped UP or SMP/CMP
4/22/2019 3:23 AM vmftc
9
Memory System Design Performance Mode VM: More architected memory and interconnection bandwidth, slightly lower latency. Reliable Mode VM: Less but more reliable and available memory system. Exploit Log Bit/Parity bits for each memory block to perform memory transaction logging. Storage and communications are protected by ECC code Optional mirrored memory images All logic modules such as cache coherence processors, could also exploit dual modes processing. Dynamic Switching between the two modes. 4/22/2019 3:23 AM vmftc
10
Memory System Design 4/22/2019 3:23 AM vmftc
11
Interconnection & I/O Channels
Dual-mode I/O controllers, channels. Multiple interconnections for both availability and performance. Performance Mode VM: More architected resource capacities due to physical resource replication. Reliable Mode VM: Less capacity for both communication bandwidth and storage due to controller cross-checking or checkpointing / logging overhead for hidden VMM activity. However, it monitors component fault rate for fault forecasting. Dynamic Switching between the two modes. 4/22/2019 3:23 AM vmftc
12
I/O & Interconnection Network
4/22/2019 3:23 AM vmftc
13
VMM issues Dynamic Configuration/Switching of the VMs
VMM Intercepts certain interrupts in reliable mode: Timer, Machine Check Interrupts, I/O Interrupts. Timer triggered checkpointing. Checkpoint state: Processor state, cache state and memory state. Communication states via QS Memory Checkpoints & Memory transaction Logging in main memory storage – Log bits to Reduce work I/O event logging for replay during Recovery. Rollback Recovery: Rollback memory image, Reload system state, I/O event replay… 4/22/2019 3:23 AM vmftc
14
VMFTC Simulation Design
Simulator Infrastructure -- PHARMsim: SimOS-PPC + SimpleMP. Precise but slow. Fault Injection Fault Detection Execution mode switch in the simulator. Checkpointing/logging and recovery with full consideration of precise I/O event handling in the PHARMsim simulator. Co-designed VMM Classical OS VMM 4/22/2019 3:23 AM vmftc
15
Problems Potential Applications -- Servers? PC/workstations? Mobile computing? Embedded systems? Whole System Level Fault Models – What are common faults and their frequency, cost etc. Cost Models in building those hooks inside the whole system. Cost for redundant resources. How to Evaluate fault tolerant computing, how to perform evaluation for a research project? Anything HW can help to recover from SW Heisenberg faults? Or anything HW can do to help SW fault tolerance in a co-designed style? 4/22/2019 3:23 AM vmftc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.