SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Slides:



Advertisements
Similar presentations
Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.
Advertisements

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Distributed Deadlocks and Transaction Recovery.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Database Recovery Techniques
Prepared by Ertuğrul Kuzan
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
The University of Adelaide, School of Computer Science
EEC 688/788 Secure and Dependable Computing
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Operating System Reliability
Ivy Eva Wu.
Address Translation for Manycore Systems
CMSC 611: Advanced Computer Architecture
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Improving Multiple-CMP Systems with Token Coherence
Distributed Transactions
Lecture 10: Consistency Models
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
The University of Adelaide, School of Computer Science
Dynamic Verification of Sequential Consistency
The University of Adelaide, School of Computer Science
Operating System Reliability
University of Wisconsin-Madison Presented by: Nick Kirchem
Operating System Reliability
Lecture 11: Consistency Models
Presentation transcript:

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) Henry CookCS2584/7/2008

Goals Create a system-wide, lightweight checkpoint and recovery mechanism Provide globally consistent logical checkpoints Have low runtime overhead Prevent crashes in the face of hard or soft errors Decouple recovery from detection

System Overview

Challenge 1 Saving every update, write, or response is expensive –Checkpoint at coarse granularity (100K) –Only log the first such action per checkpoint

Challenge 2 All procs, caches, and mems must recover to a consistent point –Global logical time –Logically atomic coherence transactions Point of atomicity –Avoid checkpointing transient state or in flight messages by waiting for transactions to complete

Challenge 2 - Global logical time Broadcast/snooping: count number of coherence requests received Distribute perfectly synchronous physical clock Distribute loosely synchronized checkpoint clock –Valid base if skew < communication time between nodes

Challenge 2 - Transactions 1.Processor requests block B 2.Memory processes request 3.Cp#2-5 not validated until transaction completes

Challenge 3 - Validation Validate only once all previous points are validated Each component must declare it has received fault-free responses to all reqs Validation latency dependent on fault detection latency

Challenge 3 SafetyNet must advance recovery point –Pipeline checkpoint validation off of the critical path –Hide latency of fault detection mechanisms Continue execution even if detection is a long latency mechanism

Recovery If recovery point cannot be advanced for a given amount of time, error must have occurred preventing message delivery State is rolled back or restored In-flight transactions are discarded Restart message is broadcast when recovery (and reconfiguration) completes

Implementation Checkpoint Log Buffer logs stored state –Add CN to blocks, log update if CCN  CN Shadow registers hold reg checkpoints Service processors coordinate recovery

Evaluation Hard or soft faults –Dropped message, failed switch Multiple benchmarks –OLTP, SPECjbb, Apache, dynamic web service, SPASH scientific Simulate 16 proc system with Simics –100 cycle register checkpoint, 8 cycle store logging, 100K checkpoint interval

Performance Insignificant difference for fault-free No crash on faults Energy efficiency?

Sensitivity Stores requiring log entry decrease as checkpoint interval decreases CLB size is dependent on interval and program behavior, not cache size

Generalizing SafetyNet can recover from any fault where: –A mechanism in the system can detect the fault (or its absence) –Faults are detected while a recovery point is still being maintained