Download presentation
Presentation is loading. Please wait.
Published byErica Houston Modified over 8 years ago
1
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc
2
DS - IX - NFT - 1 NETWORK FAULT TOLERANCE OBJECTIVES: –TO INTRODUCE FAULT TOLERANCE TECHNIQUES USED IN COMPUTER NETWORKS CONTENTS: –COMPUTER NETWORKS –BASIC TECHNIQUES –EXAMPLE-MULTISTAGE NETWORKS
3
DS - IX - NFT - 2 COMPUTER NETWORKS PACKET SWITCHING VS. CIRCUIT SWITCHING POINT-TO-POINT VS. INDIRECT STATIC VS. DYNAMIC SINGLE PATH VS. MULTIPATH EXAMPLES: –BUS –RING –MULTISTAGE (e.g., BANYAN) –CUBE –STAR –TREE
4
DS - IX - NFT - 3 BASIC TECHNIQUES RETRY (RETRANSMISSION) COMPLEMENTED RETRY WITH CORRECTION REPLICATION (e.g., dual bus) CODING SPECIAL PROTOCOLS (single handshake, double handshake, etc.) TIMING CHECKS REROUTING RETRANSMISSION with SHIFT (INTELLIGENT RETRY)
5
DS - IX - NFT - 4 EXAMPLE MULTICOMPUTER NETWORKS (1) OBJECTIVE: –RELIABLE AND TIMELY, HIGH BANDWIDTH DATA TRANSFER ISSUES: –FAULT IMPACT –RELIABILITY EVALUATION –TESTING –FAULT DIAGNOSIS –RECOVERY –FAULT TOLERANCE
6
DS - IX - NFT - 5 EXAMPLE MULTICOMPUTER NETWORKS (2) LEVEL: –SWITCH LEVEL CODES PROTOCOLS CONTROL DATA TIME –SYSTEM LEVEL CODES PROTOCOLS CONTROL DATA TIME
7
DS - IX - NFT - 6 MULTICOMPUTER NETWORK FAULT CLASSES AND THEIR IMPACT FAULT CLASS I - DATA LINK OR DATA REGISTERS –STUCK AT 0 –STUCK AT 1 –OR-BRIDGE –AND-BRIDGE FAULT CLASS II - CONTROL LINES –(DATA VALID LINE) STUCK AT VALID –(REQUEST/ACK) STUCK-AT-0, STUCK-AT-1 –(DATA STROBE) STUCK-AT-1, STUCK-AT-0
8
DS - IX - NFT - 7 FAULT IMPACT DATA BIT ERROR –NO IMMEDIATE IMPACT, BUT ERROR WILL SHOW UP IN HIGHER LEVELS LATER. MAY BE OUT OF THE SPHERE OF CONTROL WHEN DETECTED. ADDRESS TAG ERROR –DATA PACKET CANNOT REACH THE INTENDED DESTINATION. THIS MAY CAUSE WRONG DATA TO BE RETRIEVED. STUCK AT SOME VALID CONFIGURATION –DATA PACKET WILL BE MISDIRECTED OPEN CONNECTION –COMPLETE DATA LOSS SHORT CONNECTION –MAY CAUSE BROADCASTING EFFECT, DATA PACKET MISDIRECTED
9
DS - IX - NFT - 8 THE FAULT IMPACTS CAN BE GROUPED INTO: 1.CORRUPTED DATA 2.LOST DATA 3.UNEXPECTED DATA THESE FAULTS CAN BE EXTRACTED FROM THE SWITCH AND MODELED BY A FAULTY CHANNEL THAT WILL CORRUPT, LOSE, DELAY DATA TRANSMITTED THROUGH IT.
10
DS - IX - NFT - 9 WHERE TO DETECT AND RECOVER THERE ARE THREE LEVELS WHERE WE CAN PERFORM ERROR DETECTION AND RECOVERY 1.SWITCH LEVEL 2.PME LEVEL 3.SOFTWARE LEVEL
11
DS - IX - NFT - 10 SWITCH LEVEL COSTS THE LEAST (IN TERMS OF COMPUTATION) TO RECOVER HAS HIGHEST COVERAGE, MOST ERRORS ARE WITHIN "SPHERE OF CONTROL“ NEEDS EXTRA HARDWARE THE DESIGN OF DETECTION/CORRECTION MECHANISM NEEDS TO CONSIDER IMPLEMENTATION LIMITS SUCH AS LOGIC COMPLEXITY AND I/O PIN USAGE
12
DS - IX - NFT - 11 LOCALIZED RECOVERY SINCE 99 PERCENT OF ERRORS ARE "SOFT“, RETRY IS AN EFFECTIVE WAY TO RECOVER FROM FAULTS 100 PERCENT COVERAGE OF SINGLE MESSAGE LOSS REQUIRES ONLY MODEST NUMBER OF PINS ERROR CORRECTING CODES HAVE PROHIBITIVE PINOUT (62% OVERHEAD FOR 8-BIT DATA CHANNEL).
13
DS - IX - NFT - 12 FAULT TOLERANCE TECHNIQUES FOR GLOBAL RECOVERY 1.DYNAMIC FULL ACCESS (DFA) –IF THE NETWORK GRAPH IS MAXIMALLY CONNECTED THE RECOVERY IS FEASIBLE 2.MULTIPLE NETWORKS (FAULT TOLERANCE + IMPROVED PERFORMANCE) –WITH OR WITHOUT BRIDGES 3.REDUNDANT SWITCHES 4.EXTRA-STAGE 5.CODING
14
DS - IX - NFT - 13 PME LEVEL THERE ARE 8 BYTES IN ONE REQUEST, THEREFORE 3 EXTRA BITS MAY BE NEEDED FOR SEQUENCING. ON A 4X4 UNIDIRECTIONAL SWITCH, THIS MEANS 24 MORE PINS. FOR REQUESTS WHOSE RELATIVE ORDER NEEDS TO BE KEPT, SOME EXTRA BITS ARE NEEDED OR ELSE SEQUENTIAL CONSISTENCY MAY BE VIOLATED. ANOTHER WAY TO GET AROUND THIS IS TO ALLOW ONLY ONE OUTSTANDING REQUEST FOR SHARED DATA. HOWEVER, NOT ALL SHARED DATA MAY BE USED FOR SYNCHRONIZING, SO A FENCE COUNTER SHOULD BE PROVIDED TO LET THE PROGRAMMER DECIDE ON THE NUMBER OF ALLOWED OUTSTANDING REQUESTS.
15
DS - IX - NFT - 14 SOFTWARE LEVEL WHEN AN ERROR IS DETECTED, IT MAY BE TOO LATE TO RECOVER. EVEN IF IT IS STILL POSSIBLE, IT IS OFTEN EXPENSIVE (IN TERMS OF COMPUTATION REQUIRED). TO BE ABLE TO ROLL BACK, CHECKPOINT INFORMATION HAS TO BE SAVED FREQUENTLY. THIS INCREASES SYSTEM OVERHEAD. RESTART (OR GLOBAL RESET) IS VERY EXPENSIVE IN TERMS OF TIME.
16
DS - IX - NFT - 15 OBSERVATIONS THE IMPACT OF A FAULT ON A MULTISTAGE NETWORK MAY BE SEVERE. THE FAULT IMPACT DEPENDS ON A FAULT LOCATION (LEVEL). A SWITCH FAULT IS OBVIOUSLY MORE SEVERE THAN A LINE FAULT. EXTRA-STAGE WILL NOT HELP IF INSTANTANEOUS RECOVERY IS NOT ASSURED. USE RETRY FOR TRANSIENT AND INTERMITTENT FAULTS. USE LOCALIZED REROUTING FOR PERMANENT FAULTS. DFA AND EXTRA-STAGE COMBINED MAY PROVIDE A VERY EFFECTIVE SOLUTION IN CASE OF THE MULTIPLE FAULTS. FAULT-TOLERANT SWITCHING ELEMENT PROTOCOL AND MINIMIZATION OF ERROR LATENCY ARE CRUCIAL TO SATISFACTORY SYSTEM OPERATION.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.