ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

Slides:

Advertisements

Similar presentations

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Advertisements

Fault-Tolerant Systems Design Part 1.

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.

Microprocessor Reliability

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

(C) 2002 Daniel SorinDuke Architecture Why Computer Architecture is Exciting and Challenging Daniel Sorin Department of Electrical & Computer Engineering.

1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

Introduction CSE 410, Spring 2008 Computer Systems

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.

Fault-Tolerant Systems Design Part 1.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

CprE 458/558: Real-Time Systems

Fault-Tolerant Systems Design Part 1.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Introduction to Fault Tolerance By Sahithi Podila.

Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

CS203 – Advanced Computer Architecture Dependability & Reliability.

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Introduction to: The Architecture of the Internet

Recovery Control (Chapter 17)

CSE 410, Spring 2006 Computer Systems

Fault Tolerance & Reliability CDA 5140 Spring 2006

Operating System Reliability

Operating System Reliability

Fault Tolerance In Operating System

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Presentation Title Greg Snider QSR, Hewlett-Packard Laboratories

RAID RAID Mukesh N Tekwani

Introduction to: The Architecture of the Internet

Operating System Reliability

Fault Tolerance Distributed Web-based Systems

Operating System Reliability

Mattan Erez The University of Texas at Austin July 2015

Introduction to Fault Tolerance

Chapter 1 Introduction.

Parallel and Distributed Simulation

Introduction to: The Architecture of the Internet

Operating System Reliability

Co-designed Virtual Machines for Reliable Computer Systems

RAID RAID Mukesh N Tekwani April 23, 2019

Dynamic Verification of Sequential Consistency

Reliability and Error Control 5/17/11

Operating System Reliability

University of Wisconsin-Madison Presented by: Nick Kirchem

Seminar on Enterprise Software

Operating System Reliability

Presentation transcript:

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University

2 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Outline Definition and Motivation General Principles of Available System Design Traditional High-Availability Systems Recent Innovations in Available Systems

3 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Definitions Availability = probability that system is working OK –Function of error rate and time to recover Fault = physical defect 1)Cosmic particle disrupts charge on SRAM cell 2)Electromigration causes open circuit in switch in interconnect Error = manifestation of a fault 1)Bit flip in cache 2)Dead switch in interconnect Availability is not the same as reliability –Reliability = probability that system is OK until time T –Reliability is useful metric for mission critical systems

4 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Availability Motivation, part 1 Fault rates are increasing –More faults  more errors  more downtime  less availability Technological reasons –Smaller transistors –Denser wires Architectural reasons –More components –More aggressive designs

5 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Availability Motivation, part 2 We’re relying on computers more and more –Business –Education –Government High availability isn’t just for NASA any more Can’t afford to have unavailable services

6 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Outline Definition and Motivation General Principles of Available System Design Traditional High-Availability Systems Recent Innovations in Available Systems

7 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Goals System should correct errors as they occur –Or at least detect them Detecting an error and then crashing can be OK –Better than letting error propagate and corrupt vital data Error types –Transient: occurs once and disappears »Bit flip on link due to cosmic ray impact –Intermittent: occurs off and on »Bit flip on link due to loose connection –Permanent: occurs once and stays »Bit on link stuck at one due to short circuit

8 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Forward Error Recovery Use redundant hardware to mask faults –E.g., triple modular redundancy with voter or pair&spare Commonly used at many levels –ECC for memory and network links –RAID for disks –Redundant processors in mainframes E.g., IBM mainframes, Intel 432, Stratus Sacrifices cost to achieve availability voter CPU

9 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Backward Error Recovery If fault detected, recover system to pre-fault state A.Periodically stop system and save state –Fault? Restore pre-fault recovery point checkpoint B.Log all changes to system state –Fault? Unroll log to undo changes since recovery point E.g., Sequoia, Synapse N+1, Tandem NonStop Sacrifices performance to achieve availability

10 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 BER and the Output Commit Problem Problem: we can recover our system when we detect an error, but we can’t recover the outside world –Output commit: can’t undo effects of sending bad data out –Input commit: can’t ask outside world to re-send us data Outside world (I/O) –Disks, network, printer, etc. Output commit solution: don’t send data to outside world until we know it is error-free (yuck!) Input commit solution: log data received from outside world and replay it after recovery (not bad)

11 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Outline Definition and Motivation General Principles of Available System Design Traditional High-Availability Systems Recent Innovations in Available Systems

12 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Traditional Fault Tolerance Availability was most important aspect –Would sacrifice performance and/or hardware cost Usually simple implementations of FER and BER Used to be many companies making fault tolerant computer systems (besides IBM) –Tandem (bought by Compaq) –Synapse –Sequoia –Stratus –Etc.

13 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 IBM Mainframes The standard for high availability systems DISCUSSION

14 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Outline Definition and Motivation General Principles of Available System Design Traditional High-Availability Systems Recent Innovations in Available Systems

15 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Recent Advances in Uniprocessor Availability Availability, but with higher performance & lower cost DIVA – dynamic verification of a microprocessor –Uses checker core to dynamically verify aggressive core –FER approach with little performance cost or hardware cost AR-SMT – uses redundant thread to check execution –Similar to using redundant processor, but cheaper (esp. for SMT) –No recovery mechanism specified, just detection But what about multiprocessors? –FER is tough, but BER can be made practical

16 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 SafetyNet PRESENTATION

17 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Recovery Oriented Computing (ROC) Faults will happen – we must deal with them PRESENTATION

18 (C) 2004 Daniel J. Sorin ECE 259 / CPS 221 Outline Definition and Motivation General Principles of Available System Design Traditional High-Availability Systems Recent Innovations in Available Systems