Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year), with less than 0.01% of the calls handled incorrectly. AT&T

Introduction High-Availability Systems: An Example In 1978, Bell Labs collected data on historic trends of causes of system downtime: 20% attributed to HW ( good diagnostics and trouble-location programs can help minimize HW-induced downtime ). 15% attributed to SW ( SW deficiencies included improper translation of algorithms into code or improper specifications ). 35% attributed to recovery deficiencies ( these deficiencies can be caused by undetected faults or incorrect fault isolation ). 30% attributed to human procedural error. AT&T

Introduction High-Availability Systems: An Example AT&T Other studies on the same direction...

Introduction High-Availability Systems: An Example AT&T However, there is a user aggravation level that must be avoided: “users will redial as long as it does not happen to frequently”. There is some natural redundancy in the telephone switching network: “a telephone user will redial in he gets a wrong # or is disconnected”.

Introduction High-Availability Systems: An Example AT&T Note, however, that the thresholds are different for failure to establish a call (moderately high) and disconnection of an established call (very low): Phase Recovery ActionEffect 1Initialize transient memory.Affects temporary storage, no calls lost. 2Reconfigure peripheral HW; initialize all transient memory. Lose calls in process of being established, call in progress not lost. 3Verify memory operation; establish a workable processor configuration; verify program; configure peripheral HW; initialize all transient memory. Lose calls in process of being established, calls in progress not affected. 4Establish a workable processor configuration; configure peripheral HW; initialize all memory. All calls lost. Levels of recovery in a Telephone Switching System

Introduction High-Availability Systems: An Example AT&T In a typical telephone switching system, tasks of the Central Control Unit are related with: Overall system control/administration Call processing System maintenance Automatic isolation of faulty units Defensive SW strategies Support for rapid repair

Introduction High-Availability Systems: An Example AT&T Typical switching system diagram Central Control (CC) AU Bus Interface Program Store (PS) Call Store (CS) Auxiliary Unit (AU) Bus

Introduction High-Availability Systems: An Example AT&T CC instructions reside in the program store (PS) while transient info (e.g., telephone calls, routing, equipment configuration) is held in the call store (CS) Auxiliary Unit (AU) Bus interfaces to disk and magnetic tape mass storage.

Introduction High-Availability Systems: An Example AT&T Duplex configuration for switching computer. (Assuming that only one of each component is required for a functional system, there are 64 possible system configurations.) Central Control 2 (CC) AU 2 Bus Interface 2 Program Store 1 (PS) Program Store 2 (PS) Call Store 1 (CS) Call Store 2 (CS) Auxiliary Unit (AU) Bus PSB1 PSB: Program Store Bus PU: Peripheral Unit Bus PSB2 Bus Interface 1 Central Control 1 (CC) AU 1 PUB1PUB2

Introduction High-Availability Systems: An Example AT&T 1- Both CCs operate in synchronism. Two matched circuits compare 24 bits of internal state during each 5.5us machine cycle. 2- There are 6 different sets of internal nodes that can be monitored, depending on the instruction being executed. 3- A mismatch generates an interrupt which calls fault recognition programs to determine which half of the system is faulty. 4- Information can be sample by the matchers and retained for later examination by diagnostic programs.

Introduction High-Availability Systems: An Example AT&T 5- The OS employs Hamming code on the 37 data bits. 6- There is parity check bits over address plus data bus: the CS has one parity bit on address and data, and another parity bit just on address. 7- Both OS and CS automatically retry operations upon error detection. 8- After a fault has been detected, the system configuration logic attempts to establish various combinations of subunits. 9- A sanity program is then executed.

Introduction High-Availability Systems: An Example AT&T Summarizing some features of the FT system: Duplication of ALU. 30% of Control Logic devoted to Self-Checking. EDAC on disks. SW audits. Sanity timer (a Sanity Program is similar to a maze that the HW must traverse before the sanity timer times out. If a time-out occurs, the reconfiguration logic generates a new configuration to be tried).

Introduction High-Availability Systems: An Example AT&T Integrity monitor (Supervisor). Byte parity on datapaths. Parity checking where parity preserved, duplication otherwise. Two-parity bits on registers. Modified Hamming Code on Main Memory. Maintenance Channel for observability and controlability.

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Similar presentations

Presentation on theme: "Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Similar presentations

Presentation on theme: "Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in."— Presentation transcript:

Similar presentations

About project

Feedback