Download presentation
Presentation is loading. Please wait.
Published byArlene Blankenship Modified over 9 years ago
1
Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control- Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan Univ. of Illinois at Chicago
2
Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion
3
Assumptions/Scenarios of Past FD/FT Work Past Work on general fault detection: Past Work on general fault detection: Random single (sometimes double) faultsRandom single (sometimes double) faults Deterministic faultsDeterministic faults Types of faults: permanent, transient, intermittent; intermittent type not generally tackledTypes of faults: permanent, transient, intermittent; intermittent type not generally tackled Past Work on EM-induced faults: Past Work on EM-induced faults: No how/why/what analysis and classification of computer failure due to EM interferenceNo how/why/what analysis and classification of computer failure due to EM interference
4
Broad Goals of Our Work Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults under EM-type faults: Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults under EM-type faults: l Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults l Data errors. Causes: computation errors, memory & bus faults l Termination Errors (hung processor & crashes). Causes: C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interrupts l Note: Error types are NOT mutually exclusive Provide recipes for FT and reliable operation Provide recipes for FT and reliable operation
5
In This Work Will detect Will detect l Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults l Raw bus errors using ECC l Provide a FT mechanism using these detections for reliable operation
6
Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion
7
FD/FT Solutions Fault Detection: Fault Detection: Control flow checking (CFC) by a concurrent error detection using watchdog (WD) processorControl flow checking (CFC) by a concurrent error detection using watchdog (WD) processor Hamming ECC (2-error detecting) on data & address busesHamming ECC (2-error detecting) on data & address buses Fault Tolerance: Fault Tolerance: Instruction micro rollback triggered byInstruction micro rollback triggered by Hamming ECC Hamming ECC WD-monitored CFC WD-monitored CFC
8
General Structure of a System with a Watchdog MAIN PROCESSOR MAIN MEMORY DATA BUS ADD. BUS WATCHDOG PROCESSOR Performs various checks (CFC, address, etc.)
9
General Structure of a WD- Monitored System with On-Chip Cache ADD. BUS DATA BUS CPU MM WD Cache
10
Control Flow Checking [Mahmood, et al., IEEE TC ’ 88] Hybrid solution for detecting wrong block sequence execution Hybrid solution for detecting wrong block sequence execution Starting from a program it extracts a Control Flow Graph Starting from a program it extracts a Control Flow Graph Each node is associated to a block of branch free instructions + branch at end Each node is associated to a block of branch free instructions + branch at end Each edge is associated w/ a possible branch between two blocks Each edge is associated w/ a possible branch between two blocks Block A If cond1 then Block B if cond2 then Block D else Block E Else Block C End if Block F A B C DE F
11
Control Flow Checking Block: branch free set of instructions Block: branch free set of instructions Signature: information added to the block in order to distinguish a block from another Signature: information added to the block in order to distinguish a block from another Block augmentation & sign. insertion A B C DE F Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions Branch BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch Block
12
CFC Implemented State Diagram Reset Begin Block Error Wrong Bra Error Wrong Jump or Faulted Signature Error Wrong Computed Signature Header Middle Block Signature 1 Signature 2 Branch Error Signature Expected Computed Sign. Eq. Header Sign? GET2S GET1S Header Sign Eg. Bra Signatures? N N N N Y Y Y Y A B C DE F Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch No Branch signs
13
Micro Rollback [Tamir, et al., IEEE TC ‘ 90] Individual State Registers (RAM based) Register File, Caches, Main Mem (DWB based)
14
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 …
15
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 000 00 D0 XX 0000 XXXX
16
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 100 00 D0 XX D0 000F XXXX 0000
17
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 110 00 A3 XX D0 0101 XXXX 0000 000F
18
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 00 XX XXXX 1 111 D0 A3 D0 0000 000D 0101 000F
19
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 00 XX XXXX 1 100 D0 A3 D0 0000 000D 0101 000F
20
Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 … SUB 0002, D0 … 00 XX XXXX 1 1 D0 0000 10 D0 A3 000D 0001 000F
21
CFC with Micro Rollback - Priority Two concurrent fault detection techniques can request the processor a micro rollback Two concurrent fault detection techniques can request the processor a micro rollback They generally requests different number of levels of rollback They generally requests different number of levels of rollback Which technique should have the priority in case of simult. detection by both HC and WD? Which technique should have the priority in case of simult. detection by both HC and WD? We assign the priority to the Hamming codeWe assign the priority to the Hamming code Reason: shorter jump backs Reason: shorter jump backs Although a rationale exists for WD priority Although a rationale exists for WD priority HCWD MRB Unit uRB=1uRB=3 ??
22
CFC with Instruction Micro Rollback – State Diagram Reset Begin Block Error Wrong Branch Error Wrong Computed Signature Header Middle Block Signature 1 Signature 2 Branch GET2S GET1S Header Sign Eg. Jump Signatures? N N N N Y Y Y Y Computed Sign. Eq. Header Sign? Error Wrong Branch or Faulted Signatures Multiple points of micro rollback t<t1 t1<=t<t2 t t2 A B C DE F urb_d = 2 urb_d = bsize urb_d = 1 urb_d = 2 urb_d = 3 t = number of times the same error state is encountered. t < t1 : urb to BEGIN_BLOCK (1 instr) read header sign. again t1<=t<t2 : urb to “ Branch ” (2 instr) --re- exec prev. blk ’ s branch t >≥ t2 : urb to MIDDLE BLOCK (3 instr)-- re-read 2 branch signs. prev blk Hamming Code urb_d = 1 (re-execute previous branch) Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch
23
Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion
24
Improved VHDL Model of 68040 + Watchdog connections WD Hamming code error detect. bits Control lines Data buses
25
Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion
26
Simulation Environment The Total Fault Injection Time is simply the total duration of the intermittent fault on the bus or buses considered. The Delay Time is the time that the FG waits before starting the fault injection. The Period Time is the period of the intermittent fault. The Fault Time is the time of duration of the injection of a certain fault. Start Fault Injection First Fault Injected Second Fault Injected Period Time Fault Time Total Fault Injection Time Delay Time Fault Enable
27
Fault Parameters Values Simulations run on the model: Simulations run on the model: Faults injected on all cache busesFaults injected on all cache buses Fault typesFault types Random Double, Triple, Quadruple Faults Random Double, Triple, Quadruple Faults Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits Three values of repeat frequencyThree values of repeat frequency Low (100 clock cycles = 100KHz) Low (100 clock cycles = 100KHz) Medium (10 clock cycles = 1MHz) Medium (10 clock cycles = 1MHz) High (1 clock cycle = 10MHz) High (1 clock cycle = 10MHz) Three values of duty cycleThree values of duty cycle 25% all the simulations 25% all the simulations 50% all except high freq and 4 faults 50% all except high freq and 4 faults 75% all 2 faults and 3faults middle frequencies 75% all 2 faults and 3faults middle frequencies
28
Simulation Results (contd.)
29
NOTE: HC has better error coverage for cluster faults Block sign check (part of CFC) has better err cov for rand faults
30
Simulation Results (contd.)
31
Conclusions Micro-rollback coupled with FD for the first time Micro-rollback coupled with FD for the first time Micro-rollable WD state diagram for the first time Micro-rollable WD state diagram for the first time More extensive fault patterns than previous work More extensive fault patterns than previous work Good reliability for our FD/FT solutions (correct or fail-safe execution) Good reliability for our FD/FT solutions (correct or fail-safe execution) 3 faults: 94% low freq, 90% mid freq & 90% high freq3 faults: 94% low freq, 90% mid freq & 90% high freq 4 faults: 86% low freq, 80% mid freq & 80% high freq4 faults: 86% low freq, 80% mid freq & 80% high freq Average execution time linear with duty cycle and almost quadratic with the fault injection frequency Average execution time linear with duty cycle and almost quadratic with the fault injection frequency time ovhd 3 faults: 11% low, 12% med, 64% high freqtime ovhd 3 faults: 11% low, 12% med, 64% high freq time ovhd 4 faults: 16% low, 32% med, 182% high freqtime ovhd 4 faults: 16% low, 32% med, 182% high freq Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so detected more easily) Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so detected more easily)
32
Future Work Introduction of other fault detection techniques as triggers for micro rollback Introduction of other fault detection techniques as triggers for micro rollback Lower level fault detection like the micro instruction control flow checking -- can detect internal processor faultsLower level fault detection like the micro instruction control flow checking -- can detect internal processor faults Higher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting dataHigher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.