Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Slides:

Advertisements

Similar presentations

IDA / ADIT Lecture 10: Database recovery Jose M. Peña

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Computer Organization and Architecture

Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)

CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.

Chapter 19 Database Recovery Techniques

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.

ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.

Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.

1 Minggu 8, Pertemuan 16 Transaction Management (cont.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 23 Database Recovery Techniques.

Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.

Modified from Silberschatz, Galvin and Gagne ©2009 CS 446/646 Principles of Operating Systems Lecture 1 Chapter 1: Introduction.

The central processing unit and main memory chapter 4, Exploring the Digital Domain The Development and Basic Organization of Computers.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

July 16, 2015ICS 5411 Coping With System Failure Chapter 17 of GUW.

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Chapter 1 Introduction. Objectives To explain the definition of computer architecture To discuss the history of computers To describe the von-neumann.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.

Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 17: Recovery System.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Database Recovery Zheng (Godric) Gu. Transaction Concept Storage Structure Failure Classification Log-Based Recovery Deferred Database Modification Immediate.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦

Database recovery techniques

Database Recovery Techniques

DURABILITY OF TRANSACTIONS AND CRASH RECOVERY

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Chapter 1: Introduction

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Assembly Language for Intel-Based Computers, 5th Edition

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

Computer Architecture & Operations I

Introduction to Operating Systems

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Control unit extension for data hazards

Miss Rate versus Block Size

Database Recovery 1 Purpose of Database Recovery

The University of Adelaide, School of Computer Science

Update : about 8~16% are writes

The University of Adelaide, School of Computer Science

University of Wisconsin-Madison Presented by: Nick Kirchem

Presentation transcript:

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov

Goals Consistent checkpoint – A consistent snapshot of memory for a specific time in the past. Safe even under power failure – The checkpoint is never “in transition” Small storage overhead – Not much more than double the memory. Low performance overhead – Should not stall the processor for too long. Scalable – Scales well in large core networks such as meshes.

Related Work On the feasibility of incremental checkpointing for scientific computing by J. Sancho et al – Speculates about the future role of checkpointing in parallel machines. – As the number of processing nodes grows exponentially, failure of any one node becomes much more likely. – Error correction codes and other redundancies would introduce too much overhead when used alone. – As a result, researching Checkpoint recovery is growing in importance.

Related Work Modular Checkpointing for Atomicity by L. Ziarek et al. – Introduces an abstraction called stabilizers to make checkpointing easier. – Targets message-passing machines Makes consistent checkpointing more challenging.

Related Work SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery by D. Sorin et al. – Explores the concept of checkpointing in logical time. – Multiple checkpoints. – Each dirty cache line has a tag indicating when it was modified relative to a checkpoint. – Low execution overhead. – Not safe from power failures.

Related Work ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors by M. Prvulovic et al. – Explores different ways of rollback recovery in shared- memory multiprocessor systems. Considers: the scope of the checkpoint memory checkpointing mechanism. – Achieves about 6% checkpointing overhead. – Not safe from power failures. – Not geared towards non-volatile memory: requires fast writes.

Related Work Efficient Initialization and Crash Recovery for Log- based File Systems over Flash Memory by Chin Wu et al. – As Flash Memory becomes cheaper and denser, the uses for Flash increase. – Uses flash for recovering file systems. – Yet another use of flash for recovery. – Use a log-based method to accelerate remounting after system crash by minimizing the amount of information that has to be changed upon reboot.

Memory Controller DRAM Core L1 L2

Memory Controller DRAM Checkpoint A Cache Checkpoint Controller Checkpoint B Checkpoint A Cache Checkpoint Controller Checkpoint B L1 L2 Check point Buffer Log Check point Buffer Log Check point Buffer Log Check point Buffer Log DRAM Checkpointer Address Decoder DRAM Checkpointer Checkpoint Coordinator Checkpoint A Checkpoint B Core

Checkpointing Techniques For Caches and Cores: – Each cache/core has two flash storages adjacent to it. One is for the previous checkpoint One for the current checkpoint. – During a checkpoint, the cache/core internal state is copied to flash storage. For DRAM: – The checkpointing system snoops on DRAM. – DRAM changes are continuously logged to flash memory. – A chain of parallel buffers ensues that DRAM checkpointing almost never causes a stall.

Responsibilities of the Main Components Checkpoint Coordinator – Notifies the nodes and DRAM checkpointers that a checkpoint is beginning. DRAM Checkpointer – Continuously logs DRAM changes. – Checkpoints when instructed by the coordinator. Cache Checkpoint Controller – Checkpoints the adjacent cache when instructed by the coordinator.

Steps for Checkpointing (1 of 2) 1.The coordinator sets the checkpoint signal to 1. 2.In parallel each a.Core: i.Pauses processing instructions. ii.Copies internal state to flash memory. b.Cache Checkpoint Controller: i.Copies cache internal state to flash memory (data is copied one line at a time). c.DRAM Checkpointer: i.Flushes buffer to flash log. ii.Notifies checkpoint coordinator that the buffer has been flushed.

Steps for Checkpointing (2 of 2) 3.The coordinator sets the checkpoint signal to 0. 4.In parallel each a.Core: i.Flips flash memory bit to indicate the new checkpoint buffer. b.Cache Checkpoint Controller: i.Flips flash memory bit to indicate the new checkpoint buffer. c.DRAM Checkpointer: i.Marks checkpoint boundary in flash log.

Checkpoint A Cache Checkpoint Controller Checkpoint B Checkpoint A Cache Checkpoint Controller Checkpoint B L1 L2 Checkpoint A Checkpoint B Core F F F F F F F F F F F F F F F F

Check point Buffer Log Check point Buffer Log Check point Buffer Log Check point Buffer Log Address Decoder Previous Checkpoint Changes Next Checkpoint Changes end start Buffered Changes Previous Checkpoint (random access)

Recovering 1.Determining which Checkpoint to use a.System checks which Checkpoint is the most recent b.If the most recent checkpoint was in progress during crash, the older checkpoint is used. 2.Restoring Previous State a.Each architectural register is rewritten. b.Each cache is written to by its adjacent FLASH buffer (one cache line at a time) c.Main Memory is recovered d.Take advantage of pipelined write if available. 3.Resume Execution a.Resume program counter b.Notify that CPU’s that the system is restoring from a checkpoint (single bit)