Download presentation
Presentation is loading. Please wait.
Published byBethany Sparks Modified over 8 years ago
1
Commercial Fault Tolerance A Tale of Two Systems Umut Bultan
2
Outline Introduction Initial target audiences Initial Fault Tolerant Philosophies Design Principals Advanced Design Chip Technology Conclusion
3
Introduction Since the 1960s, many commercial systems introduced with a specific focus on high availability or fault tolerance. Design points and evolution of IBM’s S/360 (zSeries), Tandem’s NonStop(now HP’s) Availability for enterprise costumers: ability to provide service and also acceptable respose time, accuracy and consistency of results. for aplications: “7 x 24 x forever” (ATM/POS), 100 percent availibity within given timeframes (stock market) or infrequent short disruptions
4
Availability Dimensions The families described here first targeted toward High Availabilty/Fault Tolerant quadrant and since then evolved toward Continuous Availability.
5
Initial Target Audiences S/360 were used for background jobs and overnight batch processing. First goals Detect failures quickly to preserve data integrity Locate and repaire those failures quickly enough to finish job before deadline Over time mainframes were used for real time work Today continuous operation Limiting planned downtime
6
Initial Target Audiences Original Tandem NonStop systems were targeted a few specific applications with very high availability requirements. ATMs POS Emphasis was on fault tolerant rather than continuous operation. (infrastructure might be taken down for service) NonStop servers soon moved into Stock exchange Telecommunications arena Manufacturing floors Emergency services Health care
7
Initial Fault Tolerant Philosophies zSeries Maximize hardware utilization during fault-free operation Retry and recover when failures are detected Online maintenance of hardware components (repair and upgrade of HW) Ensure data integrity, even if application disruption is required
8
Initial Fault Tolerant Philosophies NonStop Applications running after a single failure of any kind Online repair Integratin of hardware components Preserve data integrity Goals have been met with a loosely coupled design based on multiple modules with multiple interconnections among them.
9
Design Principals : zSeries Include Error Correction Code(ECC) in memory ECC is a code in which each data signal comforms to specific rules of construction. Extended Recovery Facility : permit on mainframe to monitor other. Online repair capability Operational element could take over for one that has failed Redundant elements were packaged
10
Design Principals : zSeries Today zSeries memory uses store-through cache design, spare elements and Error Correction Codes. Each microprocessor had a cache and they share a main memory. Pending instruction, results are kept in both a store-through microprocessor cache and an ECC-protected store buffer. CPU instruction retry Repair of faulty DRAMs is done using built-in spare chips. I/O subsystem uses redundant paths between all devices and main memory. The I/O channel adapters perform direct memory access with robust memory protection. Power and cooling systems are designed with no single points of failure.
11
Design Principals : NonStop First design had 2 to 16 independent processors. Processors were connected by a pair of independent buses (Dynabus). General design principle was: there be at least two of everything. Controllers had their own fail-fast requirements to not corrupt data HW was fault-intolerant, fail quickly and cleanly
12
Original Tandem System Architecture
13
Design Principals : NonStop The system was architect to maximize useful active modules and minimize redundancy. Both HW and SW were optimized to support message-based operating system. Highly distributed and fault tolerant Easy to repair and expanded online SW fault tolerant was built into the operating system from the begining. Process pairs : a primary processor runs the aplication and checkpoints to a backup process in different processor.
14
Advanced Design Both systems continuously redesigned due to evolving technologies Decreasing size and increasing speed of microprocessors
15
Chip Technology: zSeries Early zSeries Within the CPU parity and ECC were used extensively Control logic such as state machines and ALUs for parity prediction or illegal state detection. With CMOS technology totaly new design was introduced. Highly checked single-chip microprocessor. Inline checking of control and arithmetic logic was prohibited by performance penalties. Dublicate I-unit and E-unit.
16
Logical Layout of the zSeries Microprocessor Chip
17
Chip Technology: zSeries Updates to memory are maintained in an ECC- protected store buffer Key point of R-unit ECC-protected register file,the check point array Keeps track of entire state of CPU Permanent error; the service element mpves the checkpoint to another microprocessor.
18
Chip Technology: NonStop First NonStop systems were using TTL integrated circuits. Designers were able to increase self-checking logic by parity on buses and branch prediction. Tandem chosed MIPS 3000 for its initial CMOS system. Design changed to tight lockstepping of apair of MIPS chips. Looser, but still effective, level of synchronization between microprocessors Supports a triplex configuration
20
Conclusion Fault tolerance and rapid fault detection for both HW and SW are the key building blocks of continuously available applications. Built-in error recovery is important Todays challanges are the network,the environment,cyber attacks,people and operational procedures responsible for keeping applications and systems running on.
21
Thank You Any Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.