ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.

Slides:

Advertisements

Similar presentations

TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.

Advertisements

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.

Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Thread-Level Speculation Karan Singh CS

Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.

© Dennis Shasha, Philippe Bonnet 2001 Log Tuning.

1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Serverless Network File Systems Overview by Joseph Thompson.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Lecture 12 Fault Tolerance, Logging and recovery Thursday Oct 8 th, Distributed Systems.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Transactional Recovery and Checkpoints Chap

Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

1 Fault Tolerance and Recovery Mostly taken from

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Database Recovery Techniques

Rebound: Scalable Checkpointing for Coherent Shared Memory

Enforcing the Atomic and Durable Properties

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

Operating System Reliability

Operating System Reliability

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Transactional Memory Coherence and Consistency

Operating System Reliability

Fault Tolerance Distributed Web-based Systems

Operating System Reliability

Outline Introduction Background Distributed DBMS Architecture

High Performance Computing

Operating System Reliability

The University of Adelaide, School of Computer Science

Co-designed Virtual Machines for Reliable Computer Systems

A Beginners Guide to Transactions

The University of Adelaide, School of Computer Science

Operating System Reliability

University of Wisconsin-Madison Presented by: Nick Kirchem

Operating System Reliability

Presentation transcript:

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories Isaac Liu

Introduction Targeting large scale applications that provide services (need high availability) Improvements in silicon technology make modern integrated circuits prone to transient and permanent faults FER vs. BER ◦ Hardware redundancy vs. recovery

ReVive design Goal: Cost-effective general-purpose rollback recovery ◦ Modest amount of hardware (cost-effective) ◦ Recovery from a wide class of errors (General-purpose) ◦ Short system downtime due to error (high availability) ◦ Low overhead when error-free (high performance)

Hardware Modifications

Design Choices ◦ Checkpoint Storage:  Safe Internal Storage with Distributed parity  Safe External  Specialized fault class ◦ Checkpoint Separation:  Partial separation with Logging  Full separation  Partial separation with buffering (renaming) ◦ Checkpoint Consistency:  Global  (Un) Coordinated Local

Overview Periodically establish checkpoint Between checkpoints, whenever main memory written to, log the data to maintain checkpoint state. If error is detected, then use the logs to roll back state.

Design Choices ◦ Checkpoint Storage:  Safe Internal Storage with Distributed parity ◦ Checkpoint Separation:  Partial separation with Logging ◦ Checkpoint Consistency:  Global

Distributed Parity

Design Choices ◦ Checkpoint Storage:  Safe Internal Storage with Distributed parity ◦ Checkpoint Separation:  Partial separation with Logging ◦ Checkpoint Consistency:  Global

Logging

Design Choices ◦ Checkpoint Storage:  Safe Internal Storage with Distributed parity ◦ Checkpoint Separation:  Partial separation with Logging ◦ Checkpoint Consistency:  Global Checkpoint

Global checkpoint Commit all work and states to main memory. Two phase commit protocol, first sync is tentative commit, and then sync again to fully commit. Keeps two most recent checkpoints.

Global Checkpoint

Implementation issues Extra L bit for each directory entry New states in directory protocol, new messages (parity update/ack) Race Conditions ◦ Log-Data Update race ◦ Atomic Log Update Race ◦ Log-Parity Update Race ◦ Data-Parity Update Race ◦ Checkpoint commit Race

Rollback

Overhead Logging and parity maintenance ◦ Depends on application Global Checkpoint ◦ cross-processor interrupt ◦ Write dirty data to memory Rollback ◦ Recovery + Lost work + Rebuild lost memory pages

Evaluation environment CC-NUMA multiprocessor with 16 nodes Non-blocking and write-back cache Full-map directory and cache coherent protocol similar to DASH. Cache size: ◦ 16KB for L1, 128kB for L2 *Applications run on smaller problems sizes and shorter periods

Evaluation Results Cp10ms – Parity and checkpoint every 10ms CpInf – Parity and checkpoint with infinite interval Cp10msM – Mirror and checkpoint every 10ms CpInfM –Mirror and checkpoint with infinite interval

Traffic Par – parity updates Ckp – checkpoint WB – writeback RD/RDX- cache miss LOG – writing to logs

Overhead

ReVive vs. SafetyNet Both use log-based rollback mechanisms ReVive enables recovery from a permanent node ReVive does not need to change processor’s cache ReVive is more general, so it may result in larger performance overhead.

Conclusion ReVive provides: ◦ Modest amount of hardware (cost-effective) ◦ Recovery from a wide class of errors (General-purpose) ◦ Short system downtime due to error (high availability) ◦ Low overhead when error-free (high performance)