Presentation is loading. Please wait.

Presentation is loading. Please wait.

DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof.

Similar presentations


Presentation on theme: "DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof."— Presentation transcript:

1 DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc

2 DS - X - CS - 1 CASE STUDIES OBJECTIVES: –TO SHOW EXAMPLES OF EXISTING SYSTEMS WHICH ARE DESIGNED TO ASSURE HIGH RELIABILITY –TO RELATE GENERAL RELIABILITY METHODOLOGIES DESCRIBED EARLIER TO PRACTICAL IMPLEMENTATIONS OF THOSE IDEAS –TO SURVEY THE GENERAL EXISTING RELIABILITY CONCEPTS WITH EXEMPLARY CASES CONTENTS: –COMMERCIAL SYSTEMS FROM AT&T, SEQUOIA, STRATUS AND TANDEM –FTMP - FAULT-TOLERANT MULTIPROCESSOR –SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE –COMMUNICATION CONTROLLER –FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE

3 DS - X - CS - 2 AT&T's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS 5 AND 3B20 (1) REQUIREMENTS: –Downtime for the entire system not to exceed 2 hours over 40 years life –% of calls handled incorrectly < 0.02% –System outage ≤ 3 min/year –100% availability 24 hours a day from user's perspective Two minutes of downtime are contributed by –24 sec - hardware faults (20%) –18 sec - software deficiencies (15%) –36 sec - procedural errors (30%) –42 sec - recovery deficiencies (35%)

4 DS - X - CS - 3 AT&T's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS 5 AND 3B20 (2) OTHER FEATURES: –95% of hardware and software faults detected and diagnosed automatically –90% of hardware faults diagnosed within field replaceable unit (FRC). Repair time less than 2 hours on ESS 1 minute on 3B20 REDUNDANCY –FULL DUPLICATION (of critical modules) CPU, memory, I/O, disks, bus systems –STANDBY SPARES call store ERROR DETECTION (at both hardware and software levels) –replication checks –timing checks –coding checks –internal checks (self-checking)

5 DS - X - CS - 4 AT&T's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS 5 AND 3B20 (3) replication checks –duplex system with comparison on every cycle timing checks –used in all hardware components; also several timer resets driven by software interrupts coding –m-out-of-n (4-out-of-8) codes, parity and cyclic codes internal checks –address limits –multiple comparators help software to locate faults faster

6 DS - X - CS - 5 SYSTEM VIEW (3B20)

7 DS - X - CS - 6 FAULT TREATMENT Detection of an error generates an interrupt and the fault treatment and recovery programs (FT/RP) are invoked Three priority categories: –immediate interrupt (maintenance interrupt) if the fault is severe enough to effect the execution of the currently executing program –deferred interrupt if too many calls are potentially affected by interrupt, then wait until the completion of the currently executing program –polite interrupt waits until periodic routine diagnostic is executed FT/RP identify and isolate the faulty unit and reconfigure the system to use one fault-free CPU If storage has no duplication, other memory area will be assigned

8 DS - X - CS - 7 RELIABLE SOFTWARE GOALS OPERATE CONTINUOUSLY FOR MONTHS OR YEARS TECHNIQUES USED FOR HIGH SOFTWARE RELIABILITY –PROCESSES HAVE INDIVIDUAL FAULT RECOVERY AND ROLLBACK MECHANISMS WHICH RECOVER FROM HARDWARE FAILURES OR TRANSIENT SOFTWARE FAILURES –SYSTEM INTEGRITY SOFTWARE MONITORS CORRECT OPERATION OF THE ENTIRE HARDWARE AND SOFTWARE SYSTEM –AUDITS VALIDATE DATA CONSISTENCY AND RECLAIM LOST RESOURCES USING ROBUST DATA STRUCTURES –OVERLOAD CONTROLS ENSURE THE AVAILABILITY OF RESOURCES AND PREVENT CATASTROPHIC FAILURES EXCEPTION HANDLING TECHNIQUES –NONCRITICAL PROGRAMS USUALLY TERMINATE AND RESTART –CRITICAL PROGRAMS WILL ROLLBACK AND RETRY

9 DS - X - CS - 8 PROGRESSIVE RECOVERY EFFORT LEVELACTION LOCAL LOCAL RECOVERY 1OPERATING SYSTEM AND I/O DRIVER ROLLBACK 2QUICK BOOTSTRAP 3COMPLETE BOOTSTRAP; RELOAD CONFIGURATION DATABASE 4MANUAL: CLEAR ALL OF MEMORY; DO #3 ABOVE ALTHOUGH DOWNTIME DOES NOT INCREASE SIGNIFICANTLY AS RECOVERY ACTIONS ESCALATE, DISRUPTIONS TO USERS OF APPLICATIONS DO INCREASE SIGNIFICANTLY ABORTED TRANSACTIONS

10 DS - X - CS - 9 SYSTEM ENHANCEMENT GOALS INSTALL NEW HARDWARE AND SOFTWARE –WITHOUT TAKING DOWN THE SYSTEM METHODS TO ADD UPDATES –CHANGE HARDWARE AND SOFTWARE WITH NO DISRUPTION IN SERVICE –INSTALL NEW HARDWARE, FIRMWARE, OR SOFTWARE WITH MINIMAL DISRUPTION IN SERVICE OFF-LINE SOFTWARE REPLACEMENT SYSTEM –COMPILE THE NEW SOURCE CODE –COMPARE NEW OBJECT CODE TO OLD OBJECT CODE –DETERMINE KINDS OF REPLACEMENTS NEEDED –GENERATE THE REPLACEMENT FILES METHODS TO REMOVE FAULTY UPDATES –BACK OUT ANY UPDATES WHICH WERE FOUND TO CONTAIN FAULTS –AUTOMATICALLY BACK OUT OF ANY UPDATE SUSPECTED OF CAUSING A FAILURE

11 DS - X - CS - 10 OPERATOR INTERFACE GOALS HELP EFFECT A QUICK REPAIR PROVIDE IMMEDIATE FEEDBACK ON STATUS OF SYSTEM HELP OPERATOR MAKE QUICK, ACCURATE DECISIONS PREVENT DANGEROUS OPERATOR MISTAKES PROVIDE POSITIVE CONTROL OF ALL PARTS OF SYSTEM

12 DS - X - CS - 11 FAULT INJECTION AND REPAIR SIMULATION 1)OVER 10,000 SINGLE HARDWARE FAULTS WERE INJECTED AT RANDOM AND AUTOMATIC SYSTEM RECOVERY WORKED IN OVER 99.8% OF CASES 2)IN 133 SIMULATED REPAIR CASES TROUBLE LOCATION PROCEDURE (TLP) FAILED TO LOCATE FAULTY MODULE IN 5 CASES, AND IN 94% OF THE LISTS OF SUSPECTED FAULTY COMPONENTS THE FAULT WAS LOCATED WITHIN THE FIRST FIVE MODULES

13 DS - X - CS - 12 AVAILABILITY ASSURANCE MODEL AVAILABILITY –THROUGH ENTIRE LIFECYCLE TEST FOR AVAILABILITY –TO MEET SPECIFIED AVAILABILITY TRACK ON-SITE EXPERIENCE –TO ENSURE AVAILABILITY OBJECTIVES ARE MET

14 DS - X - CS - 13 SEQUOIA (Marlboro, MA 01752; ph. 617-480-0800) TIGHTLY-COUPLED MULTIPROCESSOR capable of trading performance for dependability and vice versa MC68020 PROCESSORS (20MHz clock) –up to 64 PEs –up to 128 MEs (16 M bytes with ECC) –up to 96 IOEs –two 40-bit 10MHz buses FAULT DETECTION –error-detecting codes (e.g., half odd-half even parity) –comparison of duplicated operations (duplex microprocessors) –protocol monitoring –PE faults are located by polling RECONFIGURATION –reassignment to fault-free processors

15 DS - X - CS - 14 STRATUS (also IBM's System/88) (Natick, MA 01760; ph. 617-653-1466) TWO-PAIRS OF DUPLEXED PEs (PAIR AND SPARE PAIR) UP TO 32 PEs ON RING -TYPE LOCAL AREA NETWORK RED-LIGHT NOTIFICATION ABOUT FAULTY BOARD ABILITY TO EXCHANGE BOARDS ON LINE ECC ON MEMORIES (Up to 32M bytes per PE) PERFORMANCE/FAULT TOLERANCE OPTIONS

16 DS - X - CS - 15 Memory Subsystem Memory Subsystem IOP Disk Control Comm Ethernet Memory Subsystem Memory Subsystem CPU AB STRATUS XA/R SERIES 300 PAIR AND SPARE CONCEPT STRATUS XA/R SERIES 300 MODULE

17 DS - X - CS - 16 TANDEM (Cupertino, CA 95014; ph. 408-725-6000) CONFIGURATIONS: –SINGLE SYSTEM2-16 PEs –FIBER OPTIC CABLE-CONNECTED SYSTEM UP TO 224 PEs (14X16) –WORLD-WIDE NETWORK UP TO 4,080 PEs –THE FAULT-TOLERANT COMPUTER OF THE EIGHTIES FEATURES: NONSTOP II OR NONSTOP TXP PROCESSOR WITH 64KB CACHE DUAL DYNABUS (26 Mbytes/sec) 2-8 Mbytes Memories Dual Disk (MTBF for a single disk is 3-5 years; with dual disk, THE MTBF increases to 1500 years) –FAULT DETECTION - 100% by duplication or by timeout mechanism (absence of "I'm alive" message) –FAULT-TOLERANT WITH RESPECT TO ANY SINGLE HARDWARE FAULT –RECOVERY by rollback to the latest checkpoint in memory –LATEST SYSTEM: INTEGRITY S2 USES TMR OF MIPS PROCESSORS ("SELECTIVE" TMR)

18 DS - X - CS - 17 NONSTOP CYCLONE (TANDEM COMPUTERS Inc.) CYCLONE TOLERATES SINGLE HARDWARE OR SOFTWARE FAULT IT USES A FAULT-TOLERANT LOAD BALANCING OPERATING SYSTEM CALLED GUARDIAN 90 GUARDIAN 90 MAINTAINS BACKUP OF USER PROCESSES ON SEPARATE PROCESSORS AND KEEPS CONSISTENCY BY PERIODIC CHECKPOINTING 16 AND 64 PROCESSOR CONFIGURATIONS WITH UP TO 2 GB MEMORY; 64 I/O CHANNELS (WITH FOX NETWORK UP TO 255 PROCESSORS CAN WORK TOGETHER)

19 DS - X - CS - 18 NONSTOP CYCLONE (TANDEM COMPUTERS Inc.) TANDEM NONSTOP CYCLONE SYSTEM

20 DS - X - CS - 19 CYCLONE SYSTEM ARCHITECTURE Superscalar proprietary CISC Processors A “section” is a quad of processors which are connected by duplexed DYNABUS (a proprietary, fault-tolerant bus, 40 MB/sec) “Sections” are also redundantly (duplexed both ways) interconnected by dynabus + also a proprietary up to 50M long, fault-tolerant bus which uses fiber optics BASIC PRINCIPLE – FAIL FAST (concurrent error detection or “I’m alive” messages, combined with immediate termination of operation upon detection to minimize error propagation) Replacement of components: on line SEC-DED on memories Mirrored disks DYNABUS + Four separate sections connected by DYNABUS +

21 DS - X - CS - 20 HIMALAYA K10000 (TANDEM COMPUTERS Inc.) V V H H Processor Multifunction Controller I/O SLOT Network Controller Multifunction Controller Processor Multifunction Controller I/O SLOT Network Controller Multifunction Controller

22 DS - X - CS - 21 HIMALAYA K10000’s INTERSECTION NETWORK Dual Fiber Optic Rings Section Node

23 DS - X - CS - 22 FTMP - FAULT-TOLERANT MULTIPROCESSOR (DRAPER LABS) THREE TRIADS IN TMR CONFIGURATION (NINE PROCESSOR SYSTEM) TMR ON COMMUNICATION LINES FAULT-TOLERANT TMR CLOCK FAULT-TOLERANT WITH RESPECT TO ANY SINGLE FAULT DESIGN GOALS –10 -9 FAILURES/HOUR –10 HOUR MISSION TIME –300 HOUR MAINTENANCE INTERVALS

24 DS - X - CS - 23 T2 T3 T4 T1T4 Network Element T2 T1 T3 T2 I\OT3T4T1 FAULT-TOLERANT PARALLEL PROCESSOR (FTPP FROM Draper Labs) A four-triplex group cluster Byzantine resilience An ensemble of 16 triplex groups

25 DS - X - CS - 24 SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE NINE PROCESSOR SYSTEM WITH CAPABILITY TO SCHEDULE TASKS TO RUN ON 1, 3, 5, 7 OR 9 PROCESSORS DEPENDING ON TASK CRITICALITY LOCAL EXECUTIVE FOR EACH TASK –error handler/detector –scheduler –software voter –repeated communication GLOBAL EXECUTIVE –runs in TMR mode –allocates resources –diagnoses reports from local error handlers SYSTEM SHOULD HAVE FAILURE RATE <10 -9 OVER 10 HOUR MISSION TIME FLEXIBLE TRADING OF PERFORMANCE AND RELIABILITY

26 DS - X - CS - 25 COMMUNICATION CONTROLLER EXAMPLE OF A SELF-TESTING MICROPROCESSOR-BASED SYSTEM A COMMUNICATION CONTROLLER FROM E- SYSTEMS, INC. THE CPU OF A SELF-TESTING SYSTEM SELF TEST PROGRAM IS STORED IN THE 1K TEST ROM. SELF TEST PROGRAM IS EXECUTED IN BACKGROUND MODE (INVOKED BY A LOW PRIORITY INTERRUPT). DETECTION OF FAULT CAUSES AN INDICATION LIGHT TO BE TURNED ON IN AN LED PANEL. THE ACTIVE MICROPROCESSOR MUST ACCESS AND RESET A TIMER AT REGULAR INTERVALS. FAILURE TO DO SO CAUSES A TIME-OUT CIRCUIT TO TRANSFER CONTROL TO THE BACK-UP MICROPROCESSOR AND TURN ON THE CPU FAULT LIGHT.

27 DS - X - CS - 26 THE CPU OF A SELF-TESTING SYSTEM ROMs ARE TESTED BY CHECK SUMMING RAM IS TESTED BY CHECKERBOARD PATTERNS WITH BUFFERING A CURRENT WORD UNDER TEST IN THE CPU REGISTER I/O TESTS ARE PERFORMED USING THE LOOP-BACK PROCEDURE. I.E., OUTPUTS ARE CONNECTED TO INPUTS UNDER THE CPU CONTROL. MICROPROCESSOR NO. 1 MICROPROCESSOR NO. 2 SYSTEM BUS CLOCK TEST ROM FAULT DISPLAY UNIT TIME-OUT CIRCUIT P DISABLE NO. 1 µ P DISABLE NO. 2 µ from J.P. Hayes and E.J. McCluskey, IEEE Computer, March 1980

28 DS - X - CS - 27 SPACE SHUTTLE SYSTEM The Data Processing System (DPS) of the Space Shuttle A FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE Five General-Purpose Computers (GPC’s) Time-shared Data Bus Two Magnetic Tape Mass Storage Units Specialized hardware components with redundancy level 2 to 5

29 DS - X - CS - 28 A FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE (1) SELF-CHECKING AND FAULT TOLERANCE ARE PROVIDED AT THE PROCESSOR, MEMORY, I/O AND BUS. SELF-CHECKING COMPUTER MODULE (SCCM) CONTAINS FOUR TYPES OF BUILDING BLOCK CIRCUITS WHICH INTERFACE MEMORIES, PROCESSORS, I/O AND EXTERNAL buses TO AN INTERNAL SCCM BUS. THE BUILDING BLOCKS PROVIDE CONCURRENT FAULT DETECTION WITHIN THEMSELVES AND IN THEIR ASSOCIATED CIRCUITRY.

30 DS - X - CS - 29 A FAULT-TOLERANT BUILDING BLOCK

31 DS - X - CS - 30 SELF-CHECKING COMPUTER MODULES THE MEMORY INTERFACE BUILDING BLOCK (MIBB) –THE MIBB SUPPORTS SINGLE ERROR CORRECTION OR DOUBLE ERROR DETECTION –THE MIBB CAN BE COMMANDED TO REPLACE ANY TWO SPECIFIED BITS (IN ALL WORDS) WITH THE TWO SPARE BITS (PERMANENT CORRECTION) THE CORE BUILDING BLOCK (CBB) –DUAL PROCESSOR SYSTEM CONTINUOUSLY COMPARES PROCESSORS OUTPUTS AND SIGNALS A FAULT IF IT DETECTS A DISAGREEMENT –THE CBB ALSO SERVES AS A BUS ARBITER AND COLLECTS ALL FAULT INDICATIONS FROM OTHER BUILDING BLOCKS AND ITS OWN INTERNAL CIRCUITRY –IF A FAULT IS DETECTED, THE CBB ATTEMPTS EITHER A PROGRAM ROLLBACK OR RESTART –IF THE FAULT RECURS, THE CBB DISABLES ITS HOST COMPUTER BY HALTING THE PROCESSORS AND DISABLING THE SCCM OUTPUTS –ANOTHER OPTION IS TO CONTINUE OPERATION USING ONE FAULT-FREE PROCESSOR AND DEFER THE MAINTENANCE –THE CBB USES INTERNAL DUPLICATION AND SELF-CHECKING LOGIC

32 DS - X - CS - 31 BUS INTERFACE BUILDING BLOCKS (BIBBS) THE BIBBS PROVIDE COMMUNICATIONS THROUGH REDUNDANT BUSES WITH OTHER COMPUTERS IN THE NETWORK STATUS MESSAGES AND CODING VERIFY PROPER TRANSMISSION AND REDUNDANT BUSES PROVIDE BACKING TRANSMISSION PATHS OVERHEAD ANALYSIS –NONREDUNDANT SYSTEM REQUIRES 35 LSI CHIPS –ADDING SCCMs INCREASES THE CHIP COUNT TO 43 (23% INCREASE) –MEMORY OVERHEAD (IF ALL OPTIONS ARE INCLUDED, MAY BE AS HIGH AS 60%

33 DS - X - CS - 32 SIFT CLOCK SYNCHRONIZATION ALGORITHM 1."READ" CLOCK VALUES C1, C2,...., C N FROM OTHER CLOCKS 2.COMPUTE 4.CLOCKS SYNCHRONIZED TO ≤ 50 µs *(ELIMINATES EFFECTS OF GROSSLY DIFFERENT OR FAILED CLOCKS) 3.COMPUTE NEW CLOCK VALUE

34 DS - X - CS - 33 CONCLUSIONS USE COMBINED METHODS OF: –CODING –RECONFIGURATION –REPLICATION –TIMERS –WATCHDOG PROCESSOR –RECOVERY POINTS –ROLL BACK OR ROLL FORWARD REMEMBER THE CONCEPT OF VERTICAL MIGRATION


Download ppt "DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof."

Similar presentations


Ads by Google