Presentation is loading. Please wait.

Presentation is loading. Please wait.

CALTECH CS137 Fall2005 -- DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection.

Similar presentations


Presentation on theme: "CALTECH CS137 Fall2005 -- DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection."— Presentation transcript:

1 CALTECH CS137 Fall2005 -- DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection

2 CALTECH CS137 Fall2005 -- DeHon Today Faults in Logic Error Detection Schemes Optimization Problem

3 CALTECH CS137 Fall2005 -- DeHon Problem Gates, wires, memories: – built out of physical media –may fail

4 CALTECH CS137 Fall2005 -- DeHon Device Physics Represent a 1 or 0 with charge –On a gate, in a memory Charge may be disrupted –  -particle (other ionizing particles) –Ground bounce –Noise coupling –Tunneling –Thermal noise –Behavior of individual electrons is statistical

5 CALTECH CS137 Fall2005 -- DeHon DRAMs Small cells Store charge dynamically on capacitor Store about 50,000 electrons Must be refreshed –Data leaks away through parasitic resistance  -particle can be 1,000,000 carriers?

6 CALTECH CS137 Fall2005 -- DeHon System Reliability Device fail with Probability: P fail Have N components in system All must work for device to work P sys = (1-P fail ) N

7 CALTECH CS137 Fall2005 -- DeHon System Reliability If N  P fail << 1  N  P fail dominates higher order terms…

8 CALTECH CS137 Fall2005 -- DeHon System Reliability P sysfail  N  P fail

9 CALTECH CS137 Fall2005 -- DeHon Modern System 100 Million  1 Billion Transistors –Not to mention wiring… > GHz = > 1 Billion Transitions / sec. N = 10 18 per second…

10 CALTECH CS137 Fall2005 -- DeHon As we scale? N increases Charge/gate decreases –Less electrons –Higher probability they wander –Greater variability in behavior Voltage levels decrease –Smaller barriers Greater variability in device parameters  P fail increases

11 CALTECH CS137 Fall2005 -- DeHon Exacerbated at Nanoscale Small numbers of dopants (10s) –High variability Small numbers of electrons (10-1000s?) –High variability –Highly susceptible to noise Small number of molecules –May break, decay…

12 CALTECH CS137 Fall2005 -- DeHon What do we do about it? Tolerate faulty components Detect faults –Not do anything bad –Try it again If statistically unlikely error, –high likelihood won’t recur. …Focus on detection…

13 CALTECH CS137 Fall2005 -- DeHon Detect Faults Key Idea: redundancy Include enough redundancy in computation –Can tell that an error occurred

14 CALTECH CS137 Fall2005 -- DeHon What kind of redundancy can we use? Multiple copies of logic Compute something about result –Parity on number of outputs –Count of number of 1’s in output

15 CALTECH CS137 Fall2005 -- DeHon Error Detection

16 CALTECH CS137 Fall2005 -- DeHon What do we protect against? Any n errors –Worst-case selection of errors

17 CALTECH CS137 Fall2005 -- DeHon Single Error Detection If P fail small: –No error: (1-P fail ) N  1-N  P fail –One error: N  P fail  (1-P fail ) N-1  N  P fail –Two errors : [N  (N-1)/2]  (P fail ) 2  (1-P fail ) N-1 Probability of an error going undetected  For: N  P fail << 1  Goes from  N  P fail  to  (N  P fail ) 2

18 CALTECH CS137 Fall2005 -- DeHon Single Error Detection (Example) Probability of an error going undetected  For: N  P fail << 1  Goes from  N  P fail  to  (N  P fail ) 2  N=10 10 P fail =10 -20  N  P fail =10 -10 <<1  ~10 10 cycles MTTF  Mean Time To Failure  1GHz = 10s  (N  P fail ) 2 =10 -20  10 20 cycles MTTUF  Mean Time To Undetected Fault  10 11 s = 3000 years

19 CALTECH CS137 Fall2005 -- DeHon Detection Overhead …but: Correction and detection circuitry increase circuit size. N detect > N logic N detect = c N logic Probability of an error going undetected  Goes from  N  P fail  to  (c  N  P fail ) 2  To come out ahead, want: c 2 << 1/(N  P fail )  c=3, N=10 10 P fail =10 -20  (c  N  P fail ) 2 =9  10 -20  10 19 cycles MTTUF  10 10 s = 300 years

20 CALTECH CS137 Fall2005 -- DeHon Detection Overhead …but: Correction and detection circuitry increase circuit size. N detect > N logic N detect = c N logic Probability of an error going undetected  Goes from  N  P fail  to  (c  N  P fail ) 2  To come out ahead, want: c 2 << 1/(N  P fail )  c=3, N=3  10 10 P fail =10 -11  N  P fail =0.3  (c  N  P fail ) 2 =0.81  worse  Neither workable!

21 CALTECH CS137 Fall2005 -- DeHon Reliability Tuning Want N  P fail small –Want: (c  N  P fail ) 2 very small Idea: –Guard subsystems independently –Make N s suitably small –Smaller probability there is a double error localized in this small subsystem That is: as long as compartmentalization guarantees very small (c  N s  P fail ) 2 : – can reduce to single detect case.

22 CALTECH CS137 Fall2005 -- DeHon Guarding Subsystems

23 CALTECH CS137 Fall2005 -- DeHon Composing Subsystems P sysundetect = (N sys /N s ) P subundetect P subundetect = (c  N s  P fail ) 2 P sysundetect = (N sys /N s ) (c  N s  P fail ) 2 P sysundetect = N sys  N s  (c  P fail ) 2 Extermes: N s = N sys N s =1 Maximum benefit factor of N sys [in practice c=f(N s )] No benefit

24 CALTECH CS137 Fall2005 -- DeHon Composing Subsystems P sysundetect = N sys  N s  (c  P fail ) 2 Example: c=3, N sys =3  10 10 P fail =10 -11 N s =10 3 3  10 10  10 3  (3  10 -11 ) 2 3 3  10 -9  3  10 -8 (<<0.81) Still < 1s MTTUF …

25 CALTECH CS137 Fall2005 -- DeHon Problem Motivates Problem: Generate logic capable of detecting any single error

26 CALTECH CS137 Fall2005 -- DeHon Terminology Fault-secure: system never produces incorrect code word –Either produces correct result –Or detects the error Self-testing: for every fault, there is some input that produces an incorrect code word –That detects the error

27 CALTECH CS137 Fall2005 -- DeHon Terminology Totally Self Checking: system is both fault-secure and self-testing.

28 CALTECH CS137 Fall2005 -- DeHon Duplication Detects any single fault (even in checker)

29 CALTECH CS137 Fall2005 -- DeHon Duplication N original gates Duplicate: + N O outputs –O xors –O/2  2  2 ors –Total 3O gates Total: 2N+3O O<N 2<c<5

30 CALTECH CS137 Fall2005 -- DeHon Duplication Total: 2N+3O O<N Rent’s Rule: O~kN p –p<1 Total: 2N+3kN p c(N)=2+3k/N (1-p) –N small  5 –N large  2

31 CALTECH CS137 Fall2005 -- DeHon Duplication with PLA Logic Duplicate

32 CALTECH CS137 Fall2005 -- DeHon PLA Duplication N product terms in original N in duplicate 2 O product terms for matching O  N 2<c<4

33 CALTECH CS137 Fall2005 -- DeHon Can we do better? Seems like overkill to compute twice?

34 CALTECH CS137 Fall2005 -- DeHon Idea Encode so outputs have some checkable property –E.g. parity

35 CALTECH CS137 Fall2005 -- DeHon Will this work? Original Logic Extra cubes for parity parity

36 CALTECH CS137 Fall2005 -- DeHon Problem Single fault may produce multiple output errors

37 CALTECH CS137 Fall2005 -- DeHon How Fix? How do we fix?

38 CALTECH CS137 Fall2005 -- DeHon No Logic Sharing No sharing Single fault effects single output

39 CALTECH CS137 Fall2005 -- DeHon Parity Checking To check parity –Need xor tree on outputs/parity –[(O+1)/2]  2  2 = 2(O+1) xors For PLA –xor would blow up –Wrap multiple times –2 product terms per xor –4  O product terms

40 CALTECH CS137 Fall2005 -- DeHon nanoPLA Wrapped xor Note: two planes here just for buffering/inversion

41 CALTECH CS137 Fall2005 -- DeHon Better or Worse than Dual? Design InsOutsOrigPtermsParityDual add4 95135283240 ex1010 10 284880568 inc 79295358 misex1 87124024 rd73 7372010 rd84 84255389441 sao2 10493114 squar5 58253849 z5xp1 7106396125 (not include checking)

42 CALTECH CS137 Fall2005 -- DeHon Can we allow sharing? When?

43 CALTECH CS137 Fall2005 -- DeHon Multiple Parity Groups Can share with different parity groups Common error flagged in both groups

44 CALTECH CS137 Fall2005 -- DeHon Multi-Parity Group Compare (AMD) Design grpsMparityOrigParityDual add4 4209135283240 ex1010 2822284880568 inc 644295358 misex1 625124024 rd73 71072010 rd84 1402255389441 sao2 91793114 squar5 536253849 z5xp1 91036396125 (not include checking)

45 CALTECH CS137 Fall2005 -- DeHon Best Results from Winter2004 CS137 Design classAMDOrigParityDual add4 193209135283240 ex1010 822284880568 inc 44295358 misex1 2325124024 rd73 * 81072010 rd84 385402255389441 sao2 131793114 squar5 3436253849 z5xp1 1036396125 (not include checking)

46 CALTECH CS137 Fall2005 -- DeHon Better or Worse than Dual? Typical results from Mitra [ITC2002] –Multi-level gate mapping to LSI std. cell library (parity here includes multiple parity)

47 CALTECH CS137 Fall2005 -- DeHon Admin Assignment #2 due Friday Wednesday reading online Friday reading handout

48 CALTECH CS137 Fall2005 -- DeHon Big Ideas Low-level physics imperfect –Statistical, noisy Larger number of devices  greater likelihood of faults Redundancy Self-checking circuits


Download ppt "CALTECH CS137 Fall2005 -- DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection."

Similar presentations


Ads by Google