Reliability and Error Control 5/17/11 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14
Announcements Don’t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint 2 Project presentations next week Let us know if you are OK with presenting on Tuesday May 24th EE 382C - S11 - Lecture 14
Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of 0.99999 Link BER is 10-15 Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement BER: bit-error rate. SER: soft error rate. The inverse is the time between errors. EE 382C - S11 - Lecture 14
Reliability and Availability Reliability: R(t) Probability that system is working at time t given that it was working at time t=0, and has had no failures in between Availability: A(t) Probability that the system is working when needed, at a given point in time t Often affected by repair process A ~ (MTBF/(MTBF+MTTR)) MTBF: mean time between failures FIT: failures in time. Inverse of MTBF with zero repair time MTTR: mean time to recovery RAS requirements: Reliability, availability and serviceability Failure modes are physical causes of errors. MTTR: Mean time to recovery. EE 382C - S11 - Lecture 14
Examples of RAS Requirements Enterprise Server A = 0.99999 System level requirement Can reflect to a network-level requirement or detect and recover from network failures In general every packet must be correctly received or system will fail Internet Router But OK to drop packets (at rate of 10-15) Turn failures into packet drops EE 382C - S11 - Lecture 14
RAS Requirements in Those Systems Dropping (reliability) Allowed or not Rate allowed (e.g., 10-15) Availability (A) 0.999 to 0.99999 Serviceability (MTTR) EE 382C - S11 - Lecture 14
MTTF and MTTR Which is cheaper to improve? MTTF or MTTR? Which is easier? EE 382C - S11 - Lecture 14
Failure Modes and Fault Models Rate Units Gaussian Noise on Channel Transient 10-20 BER Alpha Particle Strike on Memory Soft 10-9 SER Alpha Particle Strike on Logic 10-10 Electromigration Stuck-at 1 MTBF Connector corrosion 10 Operator Removes Module Fail-Stop 105 Software Failure 104 Failure mode: physical cause of failure. This table is for interconnection networks. Rate is an indication of the magnitude, not a time scale. Byzantine failures: system continues operating in a corrupted state and violates protocols in an attempt to make adjacent modules fail too. They are rare in practice. BER and SER differ in that SER is for memory elements. EE 382C - S11 - Lecture 14
An Analogy EE 382C - S11 - Lecture 14
The Bathtub Curve Infant mortality because of marginal components. Manufacturers burn in their chips before shipping them out. EE 382C - S11 - Lecture 14
Detection, Containment, and Recovery Three-step program to dealing with errors Detection – discover the error CRC codes on channels Parity or ECC codes on memories Self-checking logic Contain – prevent the error from propagating further Mask it Drop the packet (and retry) Fail stop Recover – resume normal service Return to a known state Resume sending traffic Possibly resend faulted packet These mechanisms exist at many levels-link level, router level, end-to-end protocol level, etc. Often want to protect every flit because if the CRC is at the tail flit, the intermediate flits might have already corrupted intermediate state. Can use two CRCs: one CRC for the control information for every flit, and another for the rest of the flit. Random flits should be sent over idle channels to keep testing them. EE 382C - S11 - Lecture 14
Example – Link Level Error Control Detection – CRC on channel Containment – Drop packet with error Recovery – Request retransmission and resume normal sequence How can this fail? How to fix it? The sending router keeps copies of flits until it receives an acknowledgment. This can fail if the ack itself has an error. It can be fixed by piggybacking the ack onto a flit going backwards, and checking the whole thing with the same mechanism. Operation here is masked from the router: the router never tells that an error happened. N-bit CRC will detect all but 1 in 2n multibit errors and all that involve fewer than n bits. EE 382C - S11 - Lecture 14
Link-Level Error Control (2) Flit 2 was in error. Flits 2-6 are retransmitted Why would you want to retransmit flits 3-6? If you were to retransmit only flit 2, this would reorder the flits on the channel and considerably complicate router input logic. For instance, a body flit might be received before a head flit. Pointers: Ack: next flit to be ACKed Tx: next flit to be transmitted Tail: next free slot EE 382C - S11 - Lecture 14
Channel Configuration Reconfigure channels with frequent errors Swap in “spare bits” Reduce width of channel Reduce bit rate If malfunctions continue, decommission channel Assumes routing algorithm will adapt Frequent errors: BER above some threshold. Reduce bit rate by using only the good bits of the channel. How routing algorithm will adapt depends on the topology too. EE 382C - S11 - Lecture 14
Cray BlackWidow Example Each channel is 3-bits wide at 6.25Gb/s per bit (b = 18.75Gb/s) 3-bits serialized from 24-bit flit Link-level retry rates monitored Each retry attributed to one bit of the channel If retry rate exceeds a threshold “bad bit” is switched off Channel degrades to two-bits, then one-bit, then is switched off EE 382C - S11 - Lecture 14
Router Error Control What would happen if: Header bit in input buffer flips Credit count is corrupted Router picks wrong output Selected output flips mid packet … Numerous failure modes inside the router Many lead to catastrophic failure Perhaps after hundreds of cycles after the error occurred Many others lead to insidious performance problems E.g., loosing credits EE 382C - S11 - Lecture 14
Router Error Control (2) Same steps of Detect, Confine, Recover apply Detect Parity or CRC on all storage and communication Quick consistency checks (e.g., on allocators and credits) Two copies of all other logic (in space or time) Confine Stop propagating faulty packets Operate via confinement regions (e.g., channel) Recover Reset to known good state – (sometimes via reset) Resend faulted packets (if available) Disable part of the router (fault-containment regions) Replace part of the router (how swapping) EE 382C - S11 - Lecture 14
Network-Level Error Control Model faulty routers and links as “fail-stop” components Use adaptive routing to avoid them Table based – recompute tables periodically Local adaptive – pick another minimal link (or non-minimal) Need to avoid “dead ends” and deadlocks Local rerouting: replacing a single hop over a failed link with a series of hops that reach the same intermediate point. EE 382C - S11 - Lecture 14
End-To-End Error Control Keep a copy of each packet at source until acknowledged or timeout (This buffer can get large) If error detected Drop packet (Optionally) send a negative acknowledgement When packet correctly received Send positive acknowledgement When acknowledgement received Discard packet When negative acknowledgement received (or timeout) Resend packet May transmit the same packet multiple times Receivers must check if they have already received the packet that they see. Packets have a bit indicating if they are a resend, as well as a serial number. EE 382C - S11 - Lecture 14
Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of 0.99999 Link BER is 10-15 Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14
Summary Specification sets reliability requirements Drop rate Availability Failures are abstracted with fault models Bit errors, soft errors, stuck-at, fail stop Detection, Containment, and Recovery Link-level Ack and retransmit Reconfigure Router level Detect all failures Mask, retry, or reset Network level Route around faulty components End-to-End Retransmit on nack or timeout EE 382C - S11 - Lecture 14