Mattan Erez The University of Texas at Austin July 2015

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Fault-Tolerant Systems Design Part 1.
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
1 Chapter Fault Tolerant Design of Digital Systems.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
Lecture 16: Storage and I/O EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:
Fault-Tolerant Systems Design Part 1.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
“Politehnica” University of Timisoara Course No. 3: Project E MBRYONICS Evolvable Systems Winter Semester 2010.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
This project has received funding from the European Union's Seventh Framework Programme for research, technological development.
Silicon Programming--Testing1 Completing a successful project (introduction) Design for testability.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Toward Exascale Resilience Part 1: Overview of challenges and trends
A Tale of Two Erasure Codes in HDFS
Fault Tolerance Comparison
Applied Operating System Concepts
Lecture 1: Operating System Services
Hardware & Software Reliability
Toward Exascale Resilience Part 3: Fault and error modes and models
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Hierarchical Architecture
Fault Tolerance In Operating System
Real-time Software Design
Mattan Erez The University of Texas at Austin July 2015
CSCI/CMPE 3334 Systems Programming
Introduction I/O devices can be characterized by I/O bus connections
Reliability and Fault Tolerance
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
Unit 1: Introduction to Operating System
Mattan Erez The University of Texas at Austin
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Co-designed Virtual Machines for Reliable Computer Systems
Chapter 2 Operating System Overview
COMP3221: Microprocessors and Embedded Systems
Abstractions for Fault Tolerance
Database management systems
Presentation transcript:

Mattan Erez The University of Texas at Austin July 2015 Toward Exascale Resilience Part 2: System Architecture and Components Resilience fundamentals Mattan Erez The University of Texas at Austin July 2015

Resilience Reliability Algorithm Program Runtime Operating system (c) Mattan Erez Algorithm Program Runtime Operating system Compiler HW Architecture HW Devices Resilience Reliability

A concentrated exaflop (c) NVIDIA A concentrated exaflop Cohesive system for cohesive problems

From a Cray XC30

Hardware summary Cabinets (w/ VRU) (c) Mattan Erez Hardware summary Cabinets (w/ VRU) Shared power, cooling, and links (w/ VRU) Modules (w/ VRU) Sockets Processors + memories Gates, wires, and storage devices Storage separated out (for now)

System software manages resources (c) Mattan Erez System software manages resources Node operating system Distributed file system and I/O nodes System management and supervision

Resilience fundamentals (c) Mattan Erez Terminology Principles Basic actions Resilience fundamentals

(c) Mattan Erez, Jungrae Kim Faults, errors, and failures component departs from spec. service departs from spec. cause of failure Fault Error Failure open/short circuit transistor wear-out particle strike … Wrong value Wrong control Application result incorrect

Fault taxonomy (others exist) (c) Mattan Erez Fault taxonomy (others exist) What? Component malfunction May lead to errors Where? What component? Hardware or software? How? Natural / accidental / intentional? When? Permanent / transient Active / dormant Reproducible / random 0-day?

When do we care about faults? (c) Mattan Erez When do we care about faults?

Error taxonomy What? Where? How? When? Impact? (c) Mattan Erez Error taxonomy What? State that differs from fault free state Where? Values or control state How? Data dependent / independent When? Permanent / transient / intermittent Reproducible / random Impact? Masked / detected / silent

Failure taxonomy What? Where? When? Actionable outputs not within spec (c) Mattan Erez Failure taxonomy What? Actionable outputs not within spec Performance / power not within spec degraded operation Where? Can be hierarchical to a subsystem When? Signaled / silent Failstop / failthrough Consistent / inconsistent

(c) Mattan Erez Resilience = operating without failure in the presence of faults and errors Fault tolerance Fault mitigation Fault avoidance Error tolerance Error mitigation

(c) Mattan Erez (Forecast) Detect Contain Recover Repair

Fault/failure forecast (c) Mattan Erez Fault/failure forecast Use past experience to forecast future problems Avoid faults and failures Reconfiguration Preventative-repair Traditionally done offline Stress diagnostics Online work looks promising Mostly machine learning approaches Requires effective monitoring

Detection Diagnostics for faults Redundancy for errors (c) Mattan Erez Detection Diagnostics for faults Redundancy for errors Redundancy may be implicit Detected errors may identify faults Root-cause analysis Requires effective reporting

Containment Error propagation increases costs (c) Mattan Erez Containment Error propagation increases costs More to mitigate More likely to lead to a failure Containment enables resilience Careful design Timely reporting

Recover (error mitigation) (c) Mattan Erez Recover (error mitigation) Forward recovery (masking) Requires redundancy Some errors ignorable Backward recovery (re-execution) Requires preserved state

Repair (fault mitigation) (c) Mattan Erez Repair (fault mitigation) Reconfigure to degraded operation Replace with hot spare Replace with cold spare Fix Usually for software

(c) Mattan Erez Metrics Why measure Who is measuring

If you cannot measure it, you cannot improve it Lord Kelvin

FIT – the unit of resilience and reliability Generic “faults in time” (c) Mattan Erez FIT – the unit of resilience and reliability Generic “faults in time” 1 FIT  1 fault in 109 hours of operation Really annoying unit But, rates accumulate

What do users care about? (c) Mattan Erez What do users care about?

What do users care about? (c) Mattan Erez What do users care about? Time to completion Cost of computation Correctness of answer Are they part of the solution

(c) Mattan Erez Relative efficiency Performance, power, time, energy, … with resilience relative to assuming perfectly-reliable system How much are overhead is a user paying for a particular resilience scheme Setting baseline in reality is tricky Differences between components

Relative correctness Bit-by-bit perfect Same printout (c) Mattan Erez Relative correctness Bit-by-bit perfect Same printout Matches expectations Need to quantify expectations

Relative effort How much is required from the user? (c) Mattan Erez Relative effort How much is required from the user? Users are accustomed to reliable systems Users always resist change How often are they interrupted

What about systems people? (c) Mattan Erez What about systems people?

What about systems people? (c) Mattan Erez What about systems people? Underlying system failure characteristics Availability Overall / “Scheduled” *MTTI (mean time to interrupt) *MTBI (mean time between interrupts) *MTTR (mean time to repair) * = System, application, job, node, … Unscheduled donwtime