IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.

Slides:

Advertisements

Similar presentations

CS 6560: Operating Systems Design

Advertisements

Computer Organization and Architecture

An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement National Tsing Hua University Hsinchu, Taiwan Chin-Lung Su, Yi-Ting Yeh,

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.

Microprocessor Reliability

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Binocular Bilateral Controller: A Hardware Fault Tolerant Implementation Marylène Audet March 2001 VLSI Testing.

Thursday, June 08, 2006 The number of UNIX installations has grown to 10, with more expected. The UNIX Programmer's Manual, 2nd Edition, June, 1972.

Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,

1 Advanced Digital Design Asynchronous Design: Research Concept by A. Steininger and M. Delvai Vienna University of Technology.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Disks CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.

Storage Systems CSE 598d, Spring 2007 Lecture 5: Redundant Arrays of Inexpensive Disks Feb 8, 2007.

RAID Shuli Han COSC 573 Presentation.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.

Module 9 Review Questions 1. The ability for a system to continue when a hardware failure occurs is A. Failure tolerance B. Hardware tolerance C. Fault.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.

CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.

Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.

RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.

Seattle June 24-26, 2004 NASA/DoD IEEE Conference on Evolvable Hardware Self-Repairing Embryonic Memory Arrays Lucian Prodan Mihai Udrescu Mircea Vladutiu.

Fault-Tolerant Systems Design Part 1.

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Chapter Overview Microprocessors Replacing and Upgrading a CPU.

1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.

Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.

1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Paper by F.L. Kastensmidt, G. Neuberger, L. Carro, R. Reis Talk by Nick Boyd 1.

DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Types of RAM (Random Access Memory) Information Technology.

April 6, 2016ASPLOS 2016Atlanta, Georgia. Yaron Weinsberg IBM Research Idit Keidar Technion Hagar Porat Technion Eran Harpaz Technion Noam Shalev Technion.

CS203 – Advanced Computer Architecture Dependability & Reliability.

Commercial Fault Tolerance A Tale of Two Systems Umut Bultan.

1 Module 3: Processes Reading: Chapter Next Module: –Inter-process Communication –Process Scheduling –Reading: Chapter 4.5, 6.1 – 6.3.

Types of RAM (Random Access Memory)

nZDC: A compiler technique for near-Zero silent Data Corruption

Chapter 3: Processes.

Fault Tolerance In Operating System

Coding Theory Dan Siewiorek June 2012.

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Chapter 3: Processes.

Chapter 9: Virtual-Memory Management

Fault Tolerance Distributed Web-based Systems

Lecture 6: Reliability, PCM

Mark Zbikowski and Gary Kimura

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

Chapter 3: Processes.

Seminar on Enterprise Software

Presentation transcript:

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz

Some Terms Concurrent error detection & repair: The system finds errors & repairs itself while still running In-line error checking: EDC, ECC On-line error correction: Correct error while the system can still operate Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets Hard faults: Persistent faults that remain active for a significant period of time (forever?)

Background S/390 failure modes –Permanent, intermittent and transient faults –If an error occurs frequently and reaches a threshold  permanent Thermal Conduction Module (TCM) –TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber –Circuit growth rates exceed reliability gains –Parity check and ECC were used –Circuits were encapsulated –System repair required all system resources –Most repairs were concurrent

Background (cont.) CMOS –G1 (1994) to G5 –G1: Less reliable than 9020 System failures are more probable –G2: Dynamic memory sparing –G3: More robust ECC & CPU sparing (manual replacement) –G4: Concurrent CPU sparing & CPU instruction level retry –G5: Most reliable Greatly exceeds any TCM Protected good against soft faults (hard faults?)

Microprocessor Fault Tolerant Design Duplication is used by several systems –Intel, Himalaya systems –Duplication requires more than 100% hardware overhead –Error detection only! Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected –S/390 protects Transient fault rates are increasing with decreased feature sizes

Microprocessor Fault Tolerant Design (cont.) G5 Fault Tolerant Design Point –9X2: Main goal is to keep CPI low –G5: Main goal is to keep clock period short –In-line error protection is not suitable for G5: High fan-out/fan-in Increased chip area Longer wires Increased path length –Result: Duplicated I-unit and E-unit –A checker like DIVA checker: R-unit –Total hardware overhead: 35% –No performance penalty (?)

Microprocessor Fault Tolerant Design (cont.) G5 Fault Tolerant Design Point (cont.) –Recovery and on-line repair  R-unit –L1: Store-through cache –L2: Shared memory Line sparing –Up on error detection: If retry is not successful  CPU stopped –Dynamic CPU repairing (DCS) –Faulty CPU R-unit  Spare CPU R-unit

Memory Fault Tolerance ECC Permanent fault in L1  Cache line or quarter cache delete Permanent fault in L2  Cache delete –Data array or address directory marked as invalid –Spare lines L3: Main memory –Background scrubbing –On-line repair: Built-in spare chips –Word line or chip kill  After reaching threshold, replace module

I/O & Power/Cooling Subsystem Fault Tolerance Multiple paths  Path redundancy Power/Cooling subsystems

Questions Is duplication the optimal choice? No protection against hard faults! How to protect a CPU against intermittent faults? (Delay faults) Generally, they are the beginning phase of a hard fault How to protect ALU by parity check? Adder? (page 868, 1 st parag.) If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?