Presentation is loading. Please wait.

Presentation is loading. Please wait.

DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

Similar presentations


Presentation on theme: "DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang."— Presentation transcript:

1 DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang

2 OUTLINE  Background  Motivation  Key Ideas  Introduction to CRAFT  Summary and Discussion Points

3 0 1 BACKGROUND Smaller and Faster Transistors  Lower threshold voltage  Tighter noise margins  Less reliable Results  Incorrect program execution Recovery Alpha Particle Transient Faults Software Only Hardware Only REDUNDENCY Int main() { cout << “Hello\n”; } Int main() { cout << “Hello\n”; }

4 MOTIVATION AND GOAL Software Only  Inadequate coverage  Slow Hardware Only  Large Overhead/Area  High cost Hybrid Solution Better Reliability and Performance Lower Hardware Area and Cost

5 KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics : - Based on software technique - Minimal hardware adaptations - Take advantages from Software and Hardware solution Benefits : - Nearly perfect reliability - Low performance degradation - Low hardware cost Software Hardware

6 CRAFT: HYBRID OF EXISTING METHODS Hardware Method Software Method  Redundant Multithreading Technique (RMT)  Error Correcting Codes (ECC)  Software Implemented Fault Tolerance (SWIFT)  Error Detection by Duplicating Instructions (EDDI) Advantages Almost-perfect fault coverage Low performance cost Advantages High fault coverage Modest performance cost Zero hardware cost

7 EXISTING METHOD: HARDWARE RMT  RMT makes use of SMT resource through loosely synchronized redundant threads  Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC) Original Thread Checker Thread Redundant Multi- threading (RMT)

8 EXISTING METHOD: SOFTWARE SWIFT  A compiler based transformation  Store instruction is the synchronization point  Assumes that Error Correction Code (ECC) guards correctness of memory subsystem ld r3 = [r4] add r1 = r2, r3 st m[r1] = r2 (Original Code) ld r3 = [r4] mov r3’ = r3 add r1 = r2, r3 add r1’ = r2’, r3’ br Fault, r1 != r1’ br Fault, r2 != r2’ br Fault, r3 != r3’ st m[r1] = r2 (SWIFT Code)

9 CRAFT: SUITE OF THREE DETECTION SYSTEM Preliminaries List of the Suite: 1.Checking Store Buffer (CSB) 2.Load Value Queue (LVQ) 3.CSB + LVQ  Assume Single Event Upset fault model  Architecturally Correct Execution (ACE)  Detected Unrecoverable Error (DUE)  Silent Data Corruption (SDC)

10 SUITE 1: CHECKING STORE BUFFER (CSB) Solution: Add a Store Buffer to perform checks Problem to Improve: SWIFT: Vulnerable to faults in the time interval between the validation and use of a register value Use of validated valuesValidated values Vulnerable to Faults

11 CSB : IMPLEMENTATION.................. Basic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions Step 1: st [r1] = r2 Compiler duplicates the stores with single- bit version name st 1 [r1] = r2 st 2 [rt’] = r2’ Step 2: New store entries are put into CSB Duplicate entries discarded if match, marks OK to execute Step 3: Unchecked stores will be clogged at head of CSB Fault detected when CSB is filled

12 .................. CSB #0123 Address-- 0xFF0xEE Value-- 0x80x1 Validated-- NN 0xFF 0x8 0xEE 0x2 Compiler duplicates stores st [r1] = r2  st1 [r1] = r2 st2 [r1’] = r2’ Not match, not OK to go to MEM CSB : IMPLEMENTATION Basic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions Table will fill up and structural hazard Insn duplicate #1 Insn duplicate #2 Y N Store Value Checks Out! Send to MEM.

13 CSB : ADVANTAGES/ DISADVANTAGES  Checking implemented in hardware level No longer need validation code; reduces code size  Store instructions are no longer synchronization points (SWIFT) Exploit more dynamic scheduling Advantages Disadvantages  Additional compiler requirements: distance between duplicated instruction should not exceed size of CSB

14 SUITE 2: LOAD VALUE QUEUE (LVQ) Problem to Improve: SWIFT: Verify loads by generating move instruction after load, keep a copy of value Solution: Add a load value queue br faultDet, r2 != r2’ ld r1 = [r2] mov r1’ = r1

15 SUITE 2: LOAD VALUE QUEUE (LVQ) Problem to Improve: SWIFT: Window of vulnerability between load instruction and value duplication. Solution: Add a load value queue Vulnerable to Faults Copying valuesLoading values

16 LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST1.................. Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution Step 1: ld [r1] = r2 Compiler duplicates the stores with single- bit version name ld 1 [r1] = r2 ld 2 [r1’] = r2’ Step 2: Duplicated load bypassed from LVQ De-allocate entry from LVQ when two copies match Step 3: Fault detected when two duplicated copies fail to match

17 LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST1.................. Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution LVQ #0123 Address-- 0xAA0xBB Value-- 0x20x1 Validated-- NN 0xAA 0x2 0xBB 0x1 Compiler duplicates loads ld [r1] = r2  ld1 [r1] = r2 ld2 [r1’] = r2’ Error Detected! Load Value Checks Out! Insn #1 Insn #2 N Y

18 LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST1.................. Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution LVQ #0123 Address-- Value-- 0xAA Compiler duplicates loads ld [r1] = r2  ld1 [r1] = r2 ld2 [r1’] = r2’ ld insn ld insn duplicate 0xAA 0x2

19 LVQ : ADVANTAGES/ DISADVANTAGES Advantages Disadvantages  Extra hardware to enforce loads and their duplicates access same entry in LVQ  Reduces window of vulnerability by issuing duplicated load instruction  Keep memory traffic low by bypassing load value

20 SUITE 3: CSB + LVQ  Implements both CSB and LVQ simultaneously to software-only solutions like SWIFT

21 COMPARISON OF DIFFERENT APPROACHES TechniqueCategoryOpcode/ Control Load/ Store MemroyHardware Requirement RMTHWAll NoneSMT Base Machine + CSB + LVQ SWIFTSWSome None CRAFT: CSB + LVQ HybridSomeAllNoneCSB + LVQ

22 EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability: Inject randomly chosen faults to detailed microarchitectural simulation Each chosen bit-flip is tracked until completion of program Analyze final result to determine: - How much SDC is converted to DUE - How much work (# of application) did program complete before encountering SDC

23 EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC ImplementationPerformance CSBEnable better performance as it eliminates scheduling constraints LVQImpact varies by benchmark

24 SUMMARY AND CONCLUSION CRAFT, as compared to: Hybrid technique can provide better reliability with relatively low cost Software-only TechniqueHardware-only Technique Execution time reduction by 5%Significantly reduce area overhead SDC to DUE conversion rate increase by 75% Maintain comparable reliability

25 DISCUSSION POINTS  CRAFT detects fault when CSB is clogged Tradeoff between detection latency and more flexible scheduling?  Recovery method?  Evaluation in terms of coverage?

26 CRAFT Ad: Maintain all SWIFT ad and increase reliability Low cost, relatively high reliability Better performance than SWIFT DisAd: No recovery method CSB much higher performance than LVQ No evaluation on coverage Compiler is ISA&Microarch dependent


Download ppt "DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang."

Similar presentations


Ads by Google