(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 CPS216: Data-intensive Computing Systems Failure Recovery Shivnath Babu.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Making Services Fault Tolerant

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Serverless Network File Systems Overview by Joseph Thompson.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Fault-Tolerant Systems Design Part 1.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Architecture and Design of AlphaServer GS320

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Operating System Reliability

Operating System Reliability

Introduction to Operating Systems

CMSC 611: Advanced Computer Architecture

Address-Value Delta (AVD) Prediction

Operating System Reliability

Fault Tolerance Distributed Web-based Systems

Operating System Reliability

Improving Multiple-CMP Systems with Token Coherence

Simulating a $2M Commercial Server on a $2K PC

Operating System Reliability

Dynamic Verification of Sequential Consistency

Operating System Reliability

University of Wisconsin-Madison Presented by: Nick Kirchem

Operating System Reliability

The Design and Implementation of a Log-Structured File System

Presentation transcript:

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin—Madison

SafetyNet – Daniel Sorin slide 2 Overview Hardware fault frequencies are increasing Hardware checkpoint/recovery for multiprocessors –Transparent to software SafetyNet Innovations –Efficient coordination of checkpoint creation –Optimized logging of checkpoint state –Checkpoint validation off critical path SafetyNet achieves 3 goals, existing systems get 2 –High availability –High performance –Low cost

SafetyNet – Daniel Sorin slide 3 Outline Availability –Motivation –Example targeted faults –Differences between SafetyNet and existing approaches SafetyNet: Key Features A SafetyNet Implementation Evaluation Conclusions

SafetyNet – Daniel Sorin slide 4 Availability Motivation Fault frequencies are increasing 1.Technological reasons –Smaller transistors –Denser wires 2.Architectural reasons –More components –More aggressive designs Marketing trends demand more availability Need architectural solution to improve availability

SafetyNet – Daniel Sorin slide 5 Which Faults Do We Target? Hardware faults in shared memory multiprocessors –Mostly transient, some permanent, not chipkill We focus on faults outside of processor cores –Why? Good techniques for processors (e.g., DIVA) Interconnection network –Example: dead switch –Detect with timeout Cache coherence protocols –Example: lost coherence message –Detect with timeout CPU Interconnectio n Network Interconnectio n Network

SafetyNet – Daniel Sorin slide 6 System Hardware Design Space Backward Error Recovery (Tandem NonStop) Forward Error Recovery (IBM mainframes) Servers and PCs Existing systems get only 2 out of 3 features

SafetyNet – Daniel Sorin slide 7 Outline Availability SafetyNet: Key Features –System abstraction –Innovations A SafetyNet Implementation Evaluation Conclusions

SafetyNet – Daniel Sorin slide 8 SafetyNet Abstraction Processor Current Memory Checkpoint Current Memory checkpoint Current Memory Version Active (Architectural) State of System Most Recently Validated Checkpoint Recovery Point Checkpoints Awaiting Validation

SafetyNet – Daniel Sorin slide 9 SafetyNet Execution Model CP1 time CP2 CP3 CP4 CP5 recovery pt validating active Create CP3 recovery pt validating active validating Validate CP2 recovery pt validating active Create CP4 active validating recovery pt Recovery recovery pt active

SafetyNet – Daniel Sorin slide 10 SafetyNet Goal and Innovations Goal: Recover to consistent checkpoint if fault Inefficient but correct solution –Periodically quiesce entire system to take checkpoint –Checkpoints include all system state –Stop system to validate checkpoints as fault free SafetyNet innovations: 1.Efficient coordination of checkpoint creation across system 2.Optimized checkpointing of system state 3.Pipelined validation of checkpoints in background

SafetyNet – Daniel Sorin slide 11 Checkpoints must reflect consistent system state –Nodes must agree on memory values and coherence Coordinate checkpoints in logical time –Logical time is time base that respects causality Each node maintains its own logical clock –Create checkpoint every K logical cycles We need logical time base that helps coordination Key #1 Coordinating Checkpoint Creation

SafetyNet – Daniel Sorin slide 12 Logical Time Base Many logical time bases exist –Depends on coherence protocol Broadcast snooping systems –Increment clock for every coherence request processed –Nodes can be at different logical times –All nodes can agree when coherence transaction happens Directory protocol systems –Based on loosely synchronized physical clock (10 kHz) –More complicated explanation  refer to paper for details

SafetyNet – Daniel Sorin slide 13 Key #2 Optimized Checkpointing of System State Checkpoint all state needed to resume execution –Processor registers –Memory state (including cache state) –Cache coherence state Processors save register state at each checkpoint –Copy registers into shadow registers Logically, cache/memory log old data every time: –Store overwrites an old checkpoint of block –Block’s coherence ownership is transferred How can we reduce the amount of logged state?

SafetyNet – Daniel Sorin slide 14 Optimized Logging Insight: only recover at checkpoint granularity Intervals between checkpoints group writes/transfers –E.g., checkpoint every 100,000 cycles (100 μsec at 1GHz) Only log first store/transfer per block per interval Optimization at cache: –Label cache blocks with checkpoint numbers (CNs) –If write/transfer is from same checkpoint, no logging needed Large benefit due to locality of references

SafetyNet – Daniel Sorin slide 15 Key #3 Checkpoint Validation in Background Only validate when all agree checkpoint is fault-free –Example: no outstanding coherence requests in checkpoint Nodes perform fault detection, then coordinate Can be in background and pipelined –Reason why we have checkpoints awaiting validation Can hide long fault detection latencies –Number of outstanding checkpoints x checkpoint length –Design tolerance to be longer than longest detection latency Don’t slow down execution to validate checkpoints

SafetyNet – Daniel Sorin slide 16 Outline Availability SafetyNet: Key Features A SafetyNet Implementation Evaluation Conclusions

SafetyNet – Daniel Sorin slide 17 System Model Checkpoint Log Buffer (CLB) at cache and memory Just FIFO log of block writes/transfers CPU cache(s) CLB memory network interface NS half switch EW half switch reg CPs I/O bridge

SafetyNet – Daniel Sorin slide 18 Example of SafetyNet Operation Recovery point is checkpoint 2. Most recent checkpoint is 3. Active checkpoint is 4. Processor 1 owns block B (validated). CLB Cache P1 BM2000 P2 Interconnection network Regs: CP2 Regs: CP3 Regs: CP2 Regs: CP3 Addr State CN data Addr State data Addr State CN data CLB Cache

SafetyNet – Daniel Sorin slide 19 Example of SafetyNet Operation P1 stores 3000 to block B between checkpoints 3 and 4. Logs old data. P1 BM43000 P2 BM2000 Addr State CN data Addr State data Regs: CP2 Regs: CP3 Regs: CP2 Regs: CP3 CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 20 Example of SafetyNet Operation P1 loads from block B. SafetyNet uninvolved. P1 BM43000 P2 BM2000 Addr State CN data Addr State data Regs: CP2 Regs: CP3 Regs: CP2 Regs: CP3 CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 21 Example of SafetyNet Operation Coordinated creation of checkpoint 4. Active checkpoint is 5. Save register state at beginning of checkpoint 4. P1 BM43000 P2 BM2000 Regs: CP2 Regs: CP3 Regs: CP4 Regs: CP2 Regs: CP3 Regs: CP4 Addr State CN data Addr State data CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 22 Example of SafetyNet Operation P2 requests ownership of block B. P1 logs old data and sends copy to P2. P1 invalidates cache entry. P1P2 BM53000 BM2000 BM3000 Addr State CN data Addr State data Regs: CP2 Regs: CP3 Regs: CP4 Regs: CP2 Regs: CP3 Regs: CP4 CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 23 Example of SafetyNet Operation Validation of checkpoint 3. Discard checkpoint 2 registers. Recovery point is now beginning of checkpoint 3. P1P2 BM53000 BM Regs: CP3 Regs: CP4 Regs: CP3 Regs: CP4 Addr State CN data Addr State data BM2000 CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 24 Example of SafetyNet Operation Recovery (to checkpoint 3). Restore CP3 registers. Restore ownership of B to P1. Invalidate B at P2. Now restart system! P1 BM P Regs: CP3 Addr State CN data Addr State data CLB Cache CLB Cache Interconnection network

SafetyNet – Daniel Sorin slide 25 System Recovery and Restart Any component can trigger recovery –E.g., processor times out on coherence request All in-progress transactions are dropped –By definition, these transactions are not validated After recovery, resume execution –May have to reconfigure (e.g., route around dead link) –Must replay work that was lost

SafetyNet – Daniel Sorin slide 26 I/O and the Outside World Output commit problem – Can’t send uncommitted data beyond sphere of recoverability SafetyNet includes processors, memory, coherence Doesn’t include network, disks, printer, etc. Standard solution: wait to communicate with I/O Only send validated data to outside world Input commit problem – Input can’t be recovered –Standard solution: log input

SafetyNet – Daniel Sorin slide 27 Outline Availability SafetyNet: Key Features A SafetyNet Implementation Evaluation –Methodology –Runtime performance Conclusions

SafetyNet – Daniel Sorin slide 28 Methodology: Simulation & Workloads Simulation –Simics full-system simulation of 16-proc SPARC system –Detailed timing simulation of memory system MOSI directory cache coherence protocol –Simple, in-order processor model –128KB L1I/D, 4MB L2, 512KB CLB Workloads (commercial and scientific) –Online transaction processing (OLTP): IBM’s DB2 –Static web server: Apache driven by SURGE –Dynamic web server: Slashcode –Java server: SpecJBB –Scientific: barnes-hut from SPLASH2

SafetyNet – Daniel Sorin slide 29 Runtime Performance Normalize results to unprotected system

SafetyNet – Daniel Sorin slide 30 Runtime Performance Unprotected system crashes if fault occurs

SafetyNet – Daniel Sorin slide 31 Runtime Performance SafetyNet has same fault-free performance as unprotected Error bars = +/- one standard deviation

SafetyNet – Daniel Sorin slide 32 Runtime Performance SafetyNet avoids crashes in presence of lost messages

SafetyNet – Daniel Sorin slide 33 Runtime Performance SafetyNet avoids crashes in presence of dead half-switch

SafetyNet – Daniel Sorin slide 34 High-Level Comparison to ReVive ReViveSafetyNet Backward error recovery scheme Yes Fault modelTransient & permanent Transient & some permanent Processor modification NoYes Software modificationMinorNone Fault-free performance 6-10% lossNo loss Output commit latencyAt least 100 milliseconds No more than 0.4 milliseconds

SafetyNet – Daniel Sorin slide 35 Conclusions SafetyNet: global, consistent checkpointing –Low cost and high performance –Efficient logical time checkpoint coordination –Optimized checkpointing of state –Pipelined, in-background checkpoint validation Improved availability –Avoid crash in case of fault –Same fault-free performance

SafetyNet – Daniel Sorin slide 36 Performance vs. CLB Size Caveats Scaled workloads 100,000 cycle intervals

SafetyNet – Daniel Sorin slide 37 Traditional Availability Forward Error Recovery (FER) –Use redundant hardware to mask faults –E.g., triple modular redundancy with voter or pair&spare –Systems: IBM mainframes, Intel 432, Stratus –Sacrifices cost to achieve availability Backward Error Recovery (BER) –If fault detected, recover system to pre-fault state –Periodically stop system and save state or log changes –Fault? Restore pre-fault checkpoint or unroll log –Systems: Sequoia, Synapse N+1, Tandem NonStop –Sacrifices performance to achieve availability