1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Computer Architecture
System Integration and Performance
INPUT-OUTPUT ORGANIZATION
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
Avishai Wool lecture Introduction to Systems Programming Lecture 8 Input-Output.
Computer Systems/Operating Systems - Class 8
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
1: Operating Systems Overview
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Figure 1.1 Interaction between applications and the operating system.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Computer System Overview Chapter 1 Review of basic hardware concepts.
INPUT-OUTPUT ORGANIZATION
Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.
Interconnection Structures
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
THE COMPUTER SYSTEM. Lecture Objectives Computer functions – Instruction fetch & execute – Interrupt Handling – I/O functions Interconnections Computer.
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Top Level View of Computer Function and Interconnection.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
Components of a Sysplex. A sysplex is not a single product that you install in your data center. Rather, a sysplex is a collection of products, both hardware.
Computer Organization & Assembly Language © by DR. M. Amer.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
PLC ARCHITECTURE - CPU by Dr. Amin Danial Asham.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Commercial Fault Tolerance A Tale of Two Systems Umut Bultan.
An overview of I&C Systems in APR 1400 Parvaiz Ahmed Khand December 28, 2007.
System Components Operating System Services System Calls.
Chapter 1: Introduction
Architecture Background
RAID RAID Mukesh N Tekwani
Fault Tolerance Distributed Web-based Systems
Chapter 2: Operating-System Structures
Co-designed Virtual Machines for Reliable Computer Systems
RAID RAID Mukesh N Tekwani April 23, 2019
Chapter 2: Operating-System Structures
Presentation transcript:

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen

2 Introduction In traditional system, if a hardware, software, or operation fails, the system may fail and stop delivering service. In Tandem Nonstop system, parts of the system may fail but the rest of the system tolerate the fault and continue delivering service.

3 Tandem computers The basic philosophy of Tandem: no single hardware component should result in a system failure. Fault tolerance means being always available even if major components fail. Tandem Nonstop means errors to be detected and recovered without stopping the system. The Tandem Nonstop Cyclone system was introduced in It is the sixth generation of Tandem Nonstop system which was first introduced in 1976.

4 Tandem Nonstop Cyclone system A multiprocessors mainframe. The system consists of 4 to 16 processors interconnected by dual high speed buses (Dynabus). Sections of 4 processors are interconnected by dual fiber optic cables (Dynabus+). Each processor has CPU, memory, up to 4 I/O channels, executes a copy of the Guardian OS.

5 …Tandem Nonstop Cyclone system Fault detection is performed primarily by the hardware. Fault recovery is performed by the message-based OS. The system can tolerate a fault in a processor, peripheral controller, power supply, or cooling system. Failed components can be serviced online without disrupting processing.

6

7 Design principles Cyclone was designed for high performance with four principles: Fault detection by both hardware and software. Fail-fast. High reliability. Continuous system operation

8 Hardware architecture All major components are duplicated. Redundant hardware components allow continuous execution of processes in case of a component fails. If one fail, another take over. Components operate concurrently to enhance system performance and providing fault tolerance.

9 Dual-Bus path Each processor is connected to all other processors through a dual high speed buses (Dynabuses). When one bus fails, the other bus takes over without interruption.

10 Dual-Ported Controllers I/O Controllers are dual-ported to separate processors. So, each I/O device can be controlled by either of two processors. Only one processor controls the I/O device. An ownership bit in each controller selects which port is the primary. If the owning processor fails, the other processor takes over the control.

11 Dual-Ported Disks Disks are dual-ported to separate controllers. If one controller fails, the disks can communicate with the system through the other port.

12 Guardian OS Plays a vital role in detecting failures and performing recovery. Supports the managements of: processes, memory, interprocess communication, names, I/O, synchronization, debugging. Detects failures of: processors, buses, I/O channel. Performs recovery.

13 …Guardian OS Each processor executes a copy of Guardian OS. Two mechanisms that provide fault detection and recovery: - Process pair - I am alive protocol.

14 Process pair A primary process and a backup process concurrently executing in two different processors. The primary performs the work and sends a checkpoint message (state information) to the backup occasionally. If the primary fails, the backup takes over at the last checkpoint it received.

15 …Process pair… The primary is started first, then it starts the backup. The primary sends checkpoint message to the backup. Meanwhile, the backup enters a loop in which it receives checkpoint and error messages. If no error occurs, the backup updates its memory with the checkpoint message and continues the loop. If error occurs, the backup takes over at the last point it updated.

16 …Process pair When the backup becomes the primary. It creates a backup, then resumes processing at the last point it received If the backup fails, the primary creates a replacement backup

17 I am alive protocol All processors continually check on each other. Each processor transmits an I am alive message to other processors, every second. Each processor checks to see that it receives a message from every other processor. A missing message implies a processor failed. The backup process takes over the failed primary process.

18 Tandem Nonstop Cyclone system Some new features of Cyclone. superscalar, executes 2 instructions per clock cycle. implements dynamic branch prediction (strategy 2). supports direct memory access. multiple I/O channels. multiple, physically separate sections connected by dual fiber optic cables Dynabus+.

19 Fault detection Error detection in data paths is done by parity checking and parity prediction. Error detection in control paths is done by parity, illegal state detection, and self-checking. If processor hardware detects an error and can not recover, it shuts down itself in 2 clock cycles. And error is flagged in error identification registers.

20

21 …Fault detection… Faults in each processor are detected by hardware and microcode methods. Parity checking is used to detect single bit errors. Parity prediction is used on devices that change data. Hardware multiplier is protected by Recomputation with Shifted Operands (RESO).

22 …Fault detection Invalid state checking or duplication-and-comparison are used in sequential state machine. Checksum protects multiple word transmission. Microcode and OS perform consistency checks. If errors detected and unrecoverable, it executes HALT instruction and transmits error code to the Remote Maintenance Subsystem.

23 Fault recovery Each processor has online recovery mechanisms. Caches, main store, subsystem store can recovery data error by reloading from alternative copy. Main store and caches have spare RAMs that automatic replace failed RAMs.

24 Fault recovery A single error correcting and double error detecting code protect dynamic RAMs in the main memory. Asynchronous microcode process periodically checks for correctable memory errors.

25

26 Diagnosis facilities Each processor has a microprocessor that execute quick diagnostics. generate and collect signatures for quick fault detection and isolation. serves as the interface to a separate fault-tolerant Remote Maintenance Subsystem that monitors and logs events in a running system, support diagnosis of failures anywhere in the system, and report problems.

27 Conclusion Tandem Nonstop Cyclone system provide high level of performance, system availability, and data integrity. Redundant hardware components and process pair provide continuously service. Duplication of hardware component would cause the cost of the Tandem Nonstop system to be more expensive than the traditional system of the same size.