SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Slides:



Advertisements
Similar presentations
Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.
Advertisements

Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.
Acknowledgments Byron Bush, Scott S. Hilpert and Lee, JeongKyu
Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)
Chapter 19 Database Recovery Techniques
Database Systems, 8 th Edition Concurrency Control with Time Stamping Methods Assigns global unique time stamp to each transaction Produces explicit.
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Chapterb19 Transaction Management Transaction: An action, or series of actions, carried out by a single user or application program, which reads or updates.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 Principles of Database Systems With Internet and Java Applications Today’s Topic Chapter 15: Reliability and Security in Database Servers Instructor’s.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Database Recovery Zheng (Godric) Gu. Transaction Concept Storage Structure Failure Classification Log-Based Recovery Deferred Database Modification Immediate.

The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
File Processing : Recovery
Operating System Reliability
Operating System Reliability
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Database Recovery Techniques
Outline Announcements Fault Tolerance.
Operating System Reliability
Interconnect with Cache Coherency Manager
Operating System Reliability
Transactional Memory An Overview of Hardware Alternatives
Improving Multiple-CMP Systems with Token Coherence
Printed on Monday, December 31, 2018 at 2:03 PM.
Recovery System.
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
The University of Adelaide, School of Computer Science
Co-designed Virtual Machines for Reliable Computer Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Operating System Reliability
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Operating System Reliability
Presentation transcript:

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006

Target: Systems where availability is crucial SMP Commercial Servers: Application Services, Database Management Systems Motivation: Increase in Performance => Decrease in feature size => Decrease in Reliability Cost of fault-tolerant solution: Important

Approach and Challenges Decouple: Local Fault Detection - ECC, timeout, etc. Lightweight & Global Fault Recovery - SafetyNet Challenges for lightweight recovery schemes: Amount of storage (checkpoints logs) Maintain consistent global recovery point Advance global recovery point

SafetyNet: High-Level View Maintain per processor checkpoints: One globally validated recovery point Multiple coordinated checkpoints pending validation ID by global logical timestamp Fault detected => recover state to Recovery Point (Global)

Solutions: Storage Checkpoint architectural state: Registers: Shadow registers or cached copies Copy once on beginning of checkpoint Memory and Caches: Checkpoint Log Buffers (CLBs) Log incrementally stores, ownership change Log only first update per block per checkpoint

Solution: Global Coherence Logical Time Base: General agreement on checkpoint interval for each coherence transaction Loosely synchronous checkpoint clock Maintain per block Checkpoint number (CN)

Solution: Global Recovery Point Checkpoint Validation: All agree execution to that point Error Free Broadcast new Recovery Point Checkpoint Number Restart: Drain interconnection network Discard in progress coherence state Processors: restore register checkpoint Memory: undo actions in Checkpoint Log Buffers (CLBs) Caches: undo CLB

Evaluation: Performance Impact

Evaluation: Sensitivity

Evaluation: Sensitivity (Cont)

Questions Why is having a coordinated checkpoint important? Why broadcast Recovery Point Checkpoint Number twice: when advancing the recovery point when triggering recovery? Why a Sequential Consistent model? Is the scheme valid for Processor Consistency? Is this a good idea? Has it caught on?