Reliability and Error Control 5/17/11

Slides:



Advertisements
Similar presentations
Primitives for Achieving Reliability 3035/GZ01 Networked Systems Kyle Jamieson Department of Computer Science University College London.
Advertisements

1 SpaceWire Update NASA GSFC November 25, GSFC SpaceWire Status New Link core with split clock domains complete (Much faster) New Router core.
EIGRP routing protocol Omer ben-shalom Omer Ben-Shalom: Must show how EIGRP is dealing with count to infinity problem Omer Ben-Shalom: Must.
Chapter 6 Errors, Error Detection, and Error Control.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 13 Introduction to Computer Networks.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Improving TCP Performance over Ad-hoc Network 11/28/2000 Xuanming Dong, Duke Lee, and Jin Wang Course Project for EE228A --- Fall 2000 (Professor Jean.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
Chapter 6 – Connectivity Devices
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 14.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Networks and Distributed Systems Mark Stanovich Operating Systems COP 4610.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
Networks and Distributed Systems Sarah Diesburg Operating Systems COP 4610.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Principles of reliable data transfer 0.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Flow Control Ben Abdallah Abderazek The University of Aizu
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
Chapter 3: The Data Link Layer –to achieve reliable, efficient communication between two physically connected machines. –Design issues: services interface.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lec # 15.
Protocols and layering Network protocols and software Layered protocol suites The OSI 7 layer model Common network design issues and solutions.
Introduction to: The Architecture of the Internet
9. Principles of Reliable Data Transport – Part 1
Data Link Layer Flow Control.
The network-on-chip protocol
Fault-tolerant routing
Fault Tolerance & Reliability CDA 5140 Spring 2006
Advanced Computer Networks
CMPT 371 Data Communications and Networking
Operating System Reliability
Operating System Reliability
Data Link Layer What does it do?
Introduction of Transport Protocols
CONGESTION CONTROL.
TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.
Data Link Layer: Data Link Control
Mechanics of Flow Control
Packetizing Error Detection
Introduction to: The Architecture of the Internet
Packetizing Error Detection
Operating System Reliability
Chapter 5 Peer-to-Peer Protocols and Data Link Layer
Operating System Reliability
Introduction to: The Architecture of the Internet
On-time Network On-chip
RAID Redundant Array of Inexpensive (Independent) Disks
EEC 688/788 Secure and Dependable Computing
Lecture 5- Data Link Layer
CEG 4131 Computer Architecture III Miodrag Bolic
Packetizing Error Detection
Introduction to: The Architecture of the Internet
Lecture: Interconnection Networks
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The Transport Layer Reliability
Chapter 5 Peer-to-Peer Protocols and Data Link Layer
Lecture 4 Peer-to-Peer Protocols and Data Link Layer
Error Checking continued
Operating System Reliability
Operating System Reliability
Presentation transcript:

Reliability and Error Control 5/17/11 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14

Announcements Don’t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint 2 Project presentations next week Let us know if you are OK with presenting on Tuesday May 24th EE 382C - S11 - Lecture 14

Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of 0.99999 Link BER is 10-15 Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement BER: bit-error rate. SER: soft error rate. The inverse is the time between errors. EE 382C - S11 - Lecture 14

Reliability and Availability Reliability: R(t) Probability that system is working at time t given that it was working at time t=0, and has had no failures in between Availability: A(t) Probability that the system is working when needed, at a given point in time t Often affected by repair process A ~ (MTBF/(MTBF+MTTR)) MTBF: mean time between failures FIT: failures in time. Inverse of MTBF with zero repair time MTTR: mean time to recovery RAS requirements: Reliability, availability and serviceability Failure modes are physical causes of errors. MTTR: Mean time to recovery. EE 382C - S11 - Lecture 14

Examples of RAS Requirements Enterprise Server A = 0.99999 System level requirement Can reflect to a network-level requirement or detect and recover from network failures In general every packet must be correctly received or system will fail Internet Router But OK to drop packets (at rate of 10-15) Turn failures into packet drops EE 382C - S11 - Lecture 14

RAS Requirements in Those Systems Dropping (reliability) Allowed or not Rate allowed (e.g., 10-15) Availability (A) 0.999 to 0.99999 Serviceability (MTTR) EE 382C - S11 - Lecture 14

MTTF and MTTR Which is cheaper to improve? MTTF or MTTR? Which is easier? EE 382C - S11 - Lecture 14

Failure Modes and Fault Models Rate Units Gaussian Noise on Channel Transient 10-20 BER Alpha Particle Strike on Memory Soft 10-9 SER Alpha Particle Strike on Logic 10-10 Electromigration Stuck-at 1 MTBF Connector corrosion 10 Operator Removes Module Fail-Stop 105 Software Failure 104 Failure mode: physical cause of failure. This table is for interconnection networks. Rate is an indication of the magnitude, not a time scale. Byzantine failures: system continues operating in a corrupted state and violates protocols in an attempt to make adjacent modules fail too. They are rare in practice. BER and SER differ in that SER is for memory elements. EE 382C - S11 - Lecture 14

An Analogy EE 382C - S11 - Lecture 14

The Bathtub Curve Infant mortality because of marginal components. Manufacturers burn in their chips before shipping them out. EE 382C - S11 - Lecture 14

Detection, Containment, and Recovery Three-step program to dealing with errors Detection – discover the error CRC codes on channels Parity or ECC codes on memories Self-checking logic Contain – prevent the error from propagating further Mask it Drop the packet (and retry) Fail stop Recover – resume normal service Return to a known state Resume sending traffic Possibly resend faulted packet These mechanisms exist at many levels-link level, router level, end-to-end protocol level, etc. Often want to protect every flit because if the CRC is at the tail flit, the intermediate flits might have already corrupted intermediate state. Can use two CRCs: one CRC for the control information for every flit, and another for the rest of the flit. Random flits should be sent over idle channels to keep testing them. EE 382C - S11 - Lecture 14

Example – Link Level Error Control Detection – CRC on channel Containment – Drop packet with error Recovery – Request retransmission and resume normal sequence How can this fail? How to fix it? The sending router keeps copies of flits until it receives an acknowledgment. This can fail if the ack itself has an error. It can be fixed by piggybacking the ack onto a flit going backwards, and checking the whole thing with the same mechanism. Operation here is masked from the router: the router never tells that an error happened. N-bit CRC will detect all but 1 in 2n multibit errors and all that involve fewer than n bits. EE 382C - S11 - Lecture 14

Link-Level Error Control (2) Flit 2 was in error. Flits 2-6 are retransmitted Why would you want to retransmit flits 3-6? If you were to retransmit only flit 2, this would reorder the flits on the channel and considerably complicate router input logic. For instance, a body flit might be received before a head flit. Pointers: Ack: next flit to be ACKed Tx: next flit to be transmitted Tail: next free slot EE 382C - S11 - Lecture 14

Channel Configuration Reconfigure channels with frequent errors Swap in “spare bits” Reduce width of channel Reduce bit rate If malfunctions continue, decommission channel Assumes routing algorithm will adapt Frequent errors: BER above some threshold. Reduce bit rate by using only the good bits of the channel. How routing algorithm will adapt depends on the topology too. EE 382C - S11 - Lecture 14

Cray BlackWidow Example Each channel is 3-bits wide at 6.25Gb/s per bit (b = 18.75Gb/s) 3-bits serialized from 24-bit flit Link-level retry rates monitored Each retry attributed to one bit of the channel If retry rate exceeds a threshold “bad bit” is switched off Channel degrades to two-bits, then one-bit, then is switched off EE 382C - S11 - Lecture 14

Router Error Control What would happen if: Header bit in input buffer flips Credit count is corrupted Router picks wrong output Selected output flips mid packet … Numerous failure modes inside the router Many lead to catastrophic failure Perhaps after hundreds of cycles after the error occurred Many others lead to insidious performance problems E.g., loosing credits EE 382C - S11 - Lecture 14

Router Error Control (2) Same steps of Detect, Confine, Recover apply Detect Parity or CRC on all storage and communication Quick consistency checks (e.g., on allocators and credits) Two copies of all other logic (in space or time) Confine Stop propagating faulty packets Operate via confinement regions (e.g., channel) Recover Reset to known good state – (sometimes via reset) Resend faulted packets (if available) Disable part of the router (fault-containment regions) Replace part of the router (how swapping) EE 382C - S11 - Lecture 14

Network-Level Error Control Model faulty routers and links as “fail-stop” components Use adaptive routing to avoid them Table based – recompute tables periodically Local adaptive – pick another minimal link (or non-minimal) Need to avoid “dead ends” and deadlocks Local rerouting: replacing a single hop over a failed link with a series of hops that reach the same intermediate point. EE 382C - S11 - Lecture 14

End-To-End Error Control Keep a copy of each packet at source until acknowledged or timeout (This buffer can get large) If error detected Drop packet (Optionally) send a negative acknowledgement When packet correctly received Send positive acknowledgement When acknowledgement received Discard packet When negative acknowledgement received (or timeout) Resend packet May transmit the same packet multiple times Receivers must check if they have already received the packet that they see. Packets have a bit indicating if they are a resend, as well as a serial number. EE 382C - S11 - Lecture 14

Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of 0.99999 Link BER is 10-15 Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14

Summary Specification sets reliability requirements Drop rate Availability Failures are abstracted with fault models Bit errors, soft errors, stuck-at, fail stop Detection, Containment, and Recovery Link-level Ack and retransmit Reconfigure Router level Detect all failures Mask, retry, or reset Network level Route around faulty components End-to-End Retransmit on nack or timeout EE 382C - S11 - Lecture 14