Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 9, 2005 Defect and Fault Tolerance.

Slides:



Advertisements
Similar presentations
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Advertisements

+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
COS 461 Fall 1997 COS 461: Networks and Distributed Computing u Prof. Ed Felten u u.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
CSE 461: Error Detection and Correction. Next Topic  Error detection and correction  Focus: How do we detect and correct messages that are garbled during.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Reliability & Channel Coding
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Chapter 6 Errors, Error Detection, and Error Control.
1 Foundations of Software Design Lecture 3: How Computers Work Marti Hearst Fall 2002.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 25: April 16, 2007 Defect and Fault Tolerance.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Shivkumar KalyanaramanRensselaer Q1-1 ECSE-6600: Internet Protocols Quiz 1 Time: 60 min (strictly enforced) Points: 50 YOUR NAME: Be brief, but DO NOT.
Computer ArchitectureFall 2008 © November 12, 2007 Nael Abu-Ghazaleh Lecture 24 Disk IO.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
TCP: Software for Reliable Communication. Spring 2002Computer Networks Applications Internet: a Collection of Disparate Networks Different goals: Speed,
Chapter 6 Memory and Programmable Logic Devices
RAID Systems CS Introduction to Operating Systems.
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Redundant Array of Independent Disks
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
1 Random-Access Memory (RAM) Note: We’re skipping Sec 7.5 Today: First Hour: Static RAM –Section 7.6 of Katz’s Textbook –In-class Activity #1 Second Hour:
Common Devices Used In Computer Networks
Protocols and the TCP/IP Suite
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance.
Microprocessor-based systems Curse 7 Memory hierarchies.
CBSSS 2002: DeHon Architecture as Interface André DeHon Friday, June 21, 2002.
Post-Manufacturing ECC Customization Based on Orthogonal Latin Square Codes and Its Application to Ultra-Low Power Caches Rudrajit Datta and Nur A. Touba.
Computer Networks with Internet Technology William Stallings
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection.
EE3A1 Computer Hardware and Digital Design
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Tufts University. EE194-WIR Wireless Sensor Networks. February 17, 2005 Increased QoS through a Degraded Channel using a Cross-Layered HARQ Protocol Elliot.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 16: May 31, 2001 Defect and Fault Tolerance.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Gunjeet Kaur Dronacharya Group of Institutions. Outline I Random-Access Memory Memory Decoding Error Detection and Correction Read-Only Memory Programmable.
CALTECH CS137 Fall DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
CS Introduction to Operating Systems
ESE532: System-on-a-Chip Architecture
Communication Networks: Technology & Protocols
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
ESE534: Computer Organization
RAID RAID Mukesh N Tekwani
ESE532: System-on-a-Chip Architecture
CS333 Intro to Operating Systems
RAID RAID Mukesh N Tekwani April 23, 2019
Reliability and Error Control 5/17/11
Seminar on Enterprise Software
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 9, 2005 Defect and Fault Tolerance

Caltech CS184 Spring DeHon 2 Today Defect and Fault Tolerance –Problem –Defect Tolerance –Fault Tolerance

Caltech CS184 Spring DeHon 3 Motivation: Probabilities Given: –N objects –P yield probability What’s the probability for yield of composite system of N items? –Asssume iid faults –P(N items good) = P N

Caltech CS184 Spring DeHon 4 Probabilities P(N items good) = P N N=10 6, P= P(all good) ~= 0.37 N=10 7, P= P(all good) ~=

Caltech CS184 Spring DeHon 5 Simple Implications As N gets large –must either increase reliability –…or start tolerating failures N –memory bits –disk sectors –wires –transmitted data bits –processors –transistors –molecules –As devices get smaller, failure rates increase – chemists think P=0.95 is good

Caltech CS184 Spring DeHon 6 Defining Problems

Caltech CS184 Spring DeHon 7 Three “problems” Manufacturing imperfection –Shorts, breaks –wire/node X shorted to power, ground, another node –Doping/resistance variation too high Parameters vary over time –Electromigration –Resistance increases Incorrect operation –node X value flips crosstalk alpha particle bad timing

Caltech CS184 Spring DeHon 8 Defects Shorts example of defect Persistent problem –reliably manifests Occurs before computation Can test for at fabrication / boot time and then avoid (1 st half of lecture)

Caltech CS184 Spring DeHon 9 Faults Alpha particle bit flips is an example of a fault Fault occurs dynamically during execution At any point in time, can fail –(produce the wrong result) (2 nd half of lecture)

Caltech CS184 Spring DeHon 10 Lifetime Variation Starts out fine Over time changes –E.g. resistance increases until out of spec. Persistent –So can use defect techniques to avoid But, onset is dynamic –Must use fault detection techniques to recognize?

Caltech CS184 Spring DeHon 11 Sherkhar Bokar Intel Fellow Micro37 (Dec.2004)

Caltech CS184 Spring DeHon 12 Defect Rate Device with elements (100BT) 3 year lifetime = 10 8 seconds Accumulating up to 10% defects defects in 10 8 seconds  1 defect every 10ms At 10GHz operation: One new defect every 10 8 cycles P newdefect =10 -19

Caltech CS184 Spring DeHon 13 First Step to Recover Admit you have a problem (observe that there is a failure)

Caltech CS184 Spring DeHon 14 Detection Determine if something wrong? –Some things easy ….won’t start –Others tricky …one and gate computes False & True  True Observability –can see effect of problem –some way of telling if defect/fault present

Caltech CS184 Spring DeHon 15 Detection Coding –space of legal values < space of all values –should only see legal –e.g. parity, ECC (Error Correcting Codes) Explicit test (defects, recurring faults) –ATPG = Automatic Test Pattern Generation –Signature/BIST=Built-In Self-Test –POST = Power On Self-Test Direct/special access –test ports, scan paths

Caltech CS184 Spring DeHon 16 Coping with defects/faults? Key idea: redundancy Detection: –Use redundancy to detect error Mitigating: use redundant hardware –Use spare elements in place of faulty elements (defects) –Compute multiple times so can discard faulty result (faults) –Exploit Law-of-Large Numbers

Caltech CS184 Spring DeHon 17 Defect Tolerance

Caltech CS184 Spring DeHon 18 Two Models Disk Drives Memory Chips

Caltech CS184 Spring DeHon 19 Disk Drives Expose defects to software –software model expects faults Create table of good (bad) sectors –manages by masking out in software (at the OS level) –yielded capacity varies

Caltech CS184 Spring DeHon 20 Memory Chips Provide model in hardware of perfect chip Model of perfect memory at capacity X Use redundancy in hardware to provide perfect model Yielded capacity fixed –discard part if not achieve

Caltech CS184 Spring DeHon 21 Example: Memory Correct memory: –N slots –each slot reliably stores last value written Millions, billions, etc. of bits… –have to get them all right?

Caltech CS184 Spring DeHon 22 Memory Defect Tolerance Idea: –few bits may fail –provide more raw bits –configure so yield what looks like a perfect memory of specified size

Caltech CS184 Spring DeHon 23 Memory Techniques Row Redundancy Column Redundancy Block Redundancy

Caltech CS184 Spring DeHon 24 Row Redundancy Provide extra rows Mask faults by avoiding bad rows Trick: –have address decoder substitute spare rows in for faulty rows –use fuses to program

Caltech CS184 Spring DeHon 25 Spare Row

Caltech CS184 Spring DeHon 26 Column Redundancy Provide extra columns Program decoder/mux to use subset of columns

Caltech CS184 Spring DeHon 27 Spare Memory Column Provide extra columns Program output mux to avoid

Caltech CS184 Spring DeHon 28 Block Redundancy Substitute out entire block –e.g. memory subarray include 5 blocks –only need 4 to yield perfect (N+1 sparing more typical for larger N)

Caltech CS184 Spring DeHon 29 Spare Block

Caltech CS184 Spring DeHon 30 Yield M of N P(M of N) = P(yield N) +(N choose N-1) P(exactly N-1) +(N choose N-2) P(exactly N-2)… +(N choose N-M) P(exactly N-M)… [think binomial coefficients]

Caltech CS184 Spring DeHon 31 M of 5 example 1*P 5 + 5*P 4 (1-P) 1 +10P 3 (1-P) 2 +10P 2 (1- P) 3 +5P 1 (1-P) 4 + 1*(1-P) 5 Consider P=0.9 –1*P M=5 P(sys)=0.59 –5*P 4 (1-P) M=4 P(sys)=0.92 –10P 3 (1-P) M=3 P(sys)=0.99 –10P 2 (1-P) –5P 1 (1-P) –1*(1-P) Can achieve higher system yield than individual components!

Caltech CS184 Spring DeHon 32 Repairable Area Not all area in a RAM is repairable –memory bits spare-able –io, power, ground, control not redundant

Caltech CS184 Spring DeHon 33 Repairable Area P(yield) = P(non-repair) * P(repair) P(non-repair) = P N –N<<Ntotal –Maybe P > Prepair e.g. use coarser feature size P(repair) ~ P(yield M of N)

Caltech CS184 Spring DeHon 34 Consider a Crossbar Allows me to connect any of N things to each other –E.g. N processors N memories N/2 processors +N/2 memories

Caltech CS184 Spring DeHon 35 Crossbar Buses and Defects Two crossbars Wires may fail Switches may fail Provide more wires –Any wire fault avoidable M choose N

Caltech CS184 Spring DeHon 36 Crossbar Buses and Defects Two crossbars Wires may fail Switches may fail Provide more wires –Any wire fault avoidable M choose N

Caltech CS184 Spring DeHon 37 Crossbar Buses and Faults Two crossbars Wires may fail Switches may fail Provide more wires –Any wire fault avoidable M choose N

Caltech CS184 Spring DeHon 38 Crossbar Buses and Faults Two crossbars Wires may fail Switches may fail Provide more wires –Any wire fault avoidable M choose N –Same idea

Caltech CS184 Spring DeHon 39 Simple System P Processors M Memories Wires

Caltech CS184 Spring DeHon 40 Simple System w/ Spares P Processors M Memories Wires Provide spare –Processors –Memories –Wires

Caltech CS184 Spring DeHon 41 Simple System w/ Defects P Processors M Memories Wires Provide spare –Processors –Memories –Wires...and defects

Caltech CS184 Spring DeHon 42 Simple System Repaired P Processors M Memories Wires Provide spare –Processors –Memories –Wires Use crossbar to switch together good processors and memories

Caltech CS184 Spring DeHon 43 In Practice Crossbars are inefficient [CS184A] Use switching networks with –Locality –Segmentation –CS184A …but basic idea for sparing is the same

Caltech CS184 Spring DeHon 44 Fault Tolerance

Caltech CS184 Spring DeHon 45 Faults Bits, processors, wires –May fail during operation Basic Idea same: –Detect failure using redundancy –Correct Now –Must identify and correct online with the computation

Caltech CS184 Spring DeHon 46 Simple Memory Example Problem: bits may lose/change value –Alpha particle –Molecule spontaneously switches Idea: –Store multiple copies –Perform majority vote on result

Caltech CS184 Spring DeHon 47 Redundant Memory

Caltech CS184 Spring DeHon 48 Redundant Memory Like M-choose-N Only fail if >(N-1)/2 faults P=0.9 P(2 of 3) All good: (0.9) 3 = Any 2 good: 3(0.9) 2 (0.1)=0.243 = 0.971

Caltech CS184 Spring DeHon 49 Better: Less Overhead Don’t have to keep N copies Block data into groups Add a small number of bits to detect/correct errors

Caltech CS184 Spring DeHon 50 Row/Column Parity Think of NxN bit block as array Compute row and column parities –(total of 2N bits)

Caltech CS184 Spring DeHon 51 Row/Column Parity Think of NxN bit block as array Compute row and column parities –(total of 2N bits) Any single bit error

Caltech CS184 Spring DeHon 52 Row/Column Parity Think of NxN bit block as array Compute row and column parities –(total of 2N bits) Any single bit error By recomputing parity –Know which one it is –Can correct it

Caltech CS184 Spring DeHon 53 In Use Today Conventional DRAM Memory systems –Use 72b ECC (Error Correcting Code) –On 64b words –Correct any single bit error –Detect multibit errors CD blocks are ECC coded –Correct errors in storage/reading Learn more about ECC in EE127

Caltech CS184 Spring DeHon 54 Interconnect Also uses checksums/ECC –Guard against data transmission errors –Environmental noise, crosstalk, trouble sampling data at high rates… Often just detect error Recover by requesting retransmission –E.g. TCP/IP (Internet Protocols)

Caltech CS184 Spring DeHon 55 Interconnect Also guards against whole path failure Sender expects acknowledgement If no acknowledgement will retransmit If have multiple paths –…and select well among them –Can route around any fault in interconnect

Caltech CS184 Spring DeHon 56 Interconnect Fault Example Send message Expect Acknowledgement

Caltech CS184 Spring DeHon 57 Interconnect Fault Example Send message Expect Acknowledgement If Fail

Caltech CS184 Spring DeHon 58 Interconnect Fault Example Send message Expect Acknowledgement If Fail –No ack

Caltech CS184 Spring DeHon 59 Interconnect Fault Example If Fail  no ack –Retry –Preferably with different resource

Caltech CS184 Spring DeHon 60 Interconnect Fault Example If Fail  no ack –Retry –Preferably with different resource Ack signals success

Caltech CS184 Spring DeHon 61 Transit Multipath Butterfly (or Fat-Tree) networks with multiple paths –CS184B:Day4

Caltech CS184 Spring DeHon 62 Multiple Paths Provide bandwidth Minimize congestion Provide redundancy to tolerate faults

Caltech CS184 Spring DeHon 63 Routers May be faulty (links may be faulty) Dynamic –Corrupt data –Misroute –Send data nowhere

Caltech CS184 Spring DeHon 64 Multibutterfly Performance w/ Faults

Caltech CS184 Spring DeHon 65 Compute Elements Simplest thing we can do: –Compute redundantly –Vote on answer –Similar to redundant memory

Caltech CS184 Spring DeHon 66 Compute Elements Unlike Memory –State of computation important –Once a processor makes an error All subsequent results may be wrong Response –“reset” processors which fail vote –Go to spare set to replace failing processor

Caltech CS184 Spring DeHon 67 In Use NASA Space Shuttle –Uses set of 4 voting processors Boeing 777 –Uses voting processors (different architectures, code)

Caltech CS184 Spring DeHon 68 Forward Recovery Can take this voting idea to gate level –VonNeuman 1956 Basic gate is a majority gate –Example 3-input voter Number of technical details… High level bit: –Requires Pgate>0.996 –Can make whole system as reliable as individual gate

Caltech CS184 Spring DeHon 69 Majority Multiplexing [Roy+Beiu/IEEE Nano2004] Maybe there’s a better way… …next time.

Caltech CS184 Spring DeHon 70 Rollback Recovery Commit state of computation at key points –to memory (ECC, RAID protected...) –…reduce to previously solved problem… On faults (lifetime defects) –recover state from last checkpoint –like going to last backup…. –…(snapshot) –[analysis next time]

Caltech CS184 Spring DeHon 71 Defect vs. Fault Tolerance Defect –Can tolerate large defect rates (10%) Use virtually all good components Small overhead beyond faulty components Fault –Require lower fault rate (e.g. VN <0.4%) Overhead to do so can be quite large

Caltech CS184 Spring DeHon 72 Summary Possible to engineer practical, reliable systems from –Imperfect fabrication processes (defects) –Unreliable elements (faults) We do it today for large scale systems –Memories (DRAMs, Hard Disks, CDs) –Internet …and critical systems –Space ships, Airplanes Engineering Questions –Where invest area/effort? Higher yielding components? Tolerating faulty components? –Where do we invoke law of large numbers? Above/below the device level

Caltech CS184 Spring DeHon 73 Big Ideas Left to itself: –reliability of system << reliability of parts Can design –system reliability >> reliability of parts [defects] –system reliability ~= reliability of parts [faults] For large systems –must engineer reliability of system –…all systems becoming “large”

Caltech CS184 Spring DeHon 74 Big Ideas Detect failures –static: directed test –dynamic: use redundancy to guard Repair with Redundancy Model –establish and provide model of correctness perfect model part (memory model) visible defects in model (disk drive model)