1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
Fault-Tolerant Systems Design Part 1.
Chapter 19: Network Management Business Data Communications, 5e.
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Making Services Fault Tolerant
Chapter 19: Network Management Business Data Communications, 4e.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Lesson 12 – NETWORK SERVERS Distinguish between servers and workstations. Choose servers for Windows NT and Netware. Maintain and troubleshoot servers.
TECH CH03 System Buses Computer Components Computer Function
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
WANs and Routers Routers. Router Description Specialized computer Like a general purpose PC, a router has:  CPU  Memory  System Bus Connecting Internal.
INPUT-OUTPUT ORGANIZATION
CS-334: Computer Architecture
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
Hands-On Microsoft Windows Server 2008
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Interrupts and DMA CSCI The Role of the Operating System in Performing I/O Two main jobs of a computer are: –Processing –Performing I/O manage and.
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
Top Level View of Computer Function and Interconnection.
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Components of a Sysplex. A sysplex is not a single product that you install in your data center. Rather, a sysplex is a collection of products, both hardware.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Fault-Tolerant Systems Design Part 1.
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Unit 17: SDLC. Systems Development Life Cycle Five Major Phases Plus Documentation throughout Plus Evaluation…
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Commercial Fault Tolerance A Tale of Two Systems Umut Bultan.
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
Chapter 19: Network Management
Chapter 1: Introduction
Coding Theory Dan Siewiorek June 2012.
Programmable Logic Controllers (PLCs) An Overview.
Fault Tolerance Distributed Web-based Systems
Chapter 2: Operating-System Structures
Database System Architectures
Chapter 2: Operating-System Structures
William Stallings Computer Organization and Architecture 7th Edition
Seminar on Enterprise Software
Presentation transcript:

1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012

2 Outline u Taxonomy and Trends u General Purpose Examples u High Availability Examples u A Methodology u Conclusion

3 Application Taxonomy u General purpose Wide range of applications; frequently high performance u High availability Occasional loss of single user but not system; rapid restart u Long life No human maintenance; automatically detect and reconfigure; high coverage u Critical computations Usually real-time control systems; low recovery time; high coverage

4

5 General Purpose Examples

6 Error Detection Techniques in Typical General-Purpose System u Memory Double-error-detection code on memory data Parity on address and control information u Cache Parity on data, address, control information u I/O Unit Parity on data and control u CPU Parity on data paths Parity on control store Duplication and comparison of control logic

7 Error Recovery Techniques in Typical General-Purpose System u Memory Single-error-detection code on data Retry on address or control information parity error u Cache Retry on data, address, control information parity error u I/O Unit Retry on data or control parity errors u CPU Retry on control store parity error Invert sense of control store Macroinstruction retry

8 IBM 3090 Series Fault-Tolerance Features u Reliability Low intrinsic failure rate technology Extensive component burn-in during manufacture Dual processor controller that incorporates switchover Dual 3370 Direct Access Storage units support switchover Multiple consoles for monitoring processor activity and for backup LSI packaging vastly reduces number of circuit connections Internal machine power and temperature monitoring Chip sparing in memory replaces defective chips automatically

9 IBM 3090 Series Fault-Tolerance Features u Availability Two or four central processors Automatic error detection and correction in central and expanded storage –Single bit error correction and double bit error detection in central storage –Double bit error correction and triple bit error detection in expanded storage Storage deallocation in 4K-byte increments under system program control Ability to vary channels off line in one channel increments Instruction retry Channel command retry Error detection and fault isolation circuits provide improved recovery and serviceability Multipath I/O controllers and units

10 IBM 3090 Series Fault-Tolerance Features u Data integrity Key controlled storage protection (store and fetch) Critical address storage protection Storage error checking and correction Processor cache error handling Parity and other internal error checking Segment protection (S/370 mode) Page protection (S/370 mode) Clear reset of registers and main storage Automatic Remote Support authorization Block multiplexer channel command retry Extensive I/O recovery by hardware and control programs

11 IBM 3090 Series Fault-Tolerance Features u Serviceability Automatic fault isolation (analysis routines) concurrent with operation Automatic remote support capability – auto call to IBM if authorized by the customer Automatic customer engineer and parts dispatching Trade facilities Error logout recording Microcode update distribution via remote support facilities Remote service console capability Automatic validation tests after repair Customer problem analysis facilities

12 ED/FI in IBM 308X / 3090 u Hundreds of thousands of isolation domains u Parity checks account for 70-80% of checkers – data, address, and shift/increment parity predictors u Decoder/encoder checkers u 25% of IBM 3090 circuits for RAS u Can instantaneously detect 90% of all errors u 25% of faults assumed solid for the technology u If less that two weeks between events, the cause is assumed to be the same intermittent u Call service if 24 errors in 2 hours

13 High Availability Examples

14 Tandem Design Objectives u “Nonstop” operation where failures detected, components configured out of service, repaired components configured back in without stopping other system components u No single hardware failure can compromise data integrity of the system u Modular system expansion through adding more processing power, memory, and peripherals without impacting application software

15

16

17 Fault Containment u Software processes do not share state – only message passing u Hardware – no shared memory, dual porting I/O, multiple power supply

18 Fast-Fail Modules (detection) u Software – consistency checks, defensive programming u Hardware – software generated status probes, hardware self-tests

19 Software Bugs u Backup process does not encounter same state and environment, code takes a different path

20 Software u Process pairs u Transaction processing – two phase commit protocol u Log write-ahead protocol – record before and after- image of database in an audit trail u Network systems management – programmed operators help reduce administrative errors u Tandem maintenance and diagnostic system – analyze event loss to successfully call out FRU 90% of time

21

22 Error Handling u Error detection logic records error u Operating system runs diagnostics Incident of failure algorithm If transient return board to service If permanent call Customer Assistant Center – CAC u CAC determines problem Selects board of same revision level Print installation instructions Ship via overnight courier u 22 field engineers support 400 systems u Service 6% / year of LCC vs. 9% for others

23

24 A Methodology

25 A Methodology u Define objectives u Limit the scope u Define confinement regions u Design error handling mechanisms u Design error reporting mechanisms u Testing of error handling/reporting mechanisms u Evaluate design

26

27

28

29

30

31

32

33

34

35 Exercising Latent Faults Dormant AreaExercise Memory locationsMCU periodically reads every array location (scrubbing) Detection mechanismsSoftware* periodically forces error conditions into the detection mechanisms Reporting mechanismsSoftware* periodically initiates and observes error reports Recovery mechanismsSoftware* periodically invokes recovery operations *Special commands to support exercising dormant areas are provided in BIUs and MCUs

36 Recovery Mechanisms and Coverage MechanismCoverage RetryTransient errors ECCStorage array address and data Spare bitDRAM replacement Memory bus pairsMemory bus failure Module shadowingModule failure, GDP, IP, or memory

37 Conclusion

38 Conclusion u Designing from first principles to produce an architecture to tolerate failures achieves better reliability, availability, and cost-effectiveness than an ad-hoc, add-on approach u It is possible to build systems in which the activities of fault detection, diagnosis, and recovery are completely automated and transparent to the user