SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

2 Motivation Hardware failures will happen in the field –Aging, soft errors, inadequate burn-in, design defects, …  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets –Traditional redundancy (e.g., nMR) too expensive –Piecemeal solutions for specific fault model too expensive –Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

3 Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

4 SWAT Framework Components Detection: Symptoms of S/W misbehavior, minimal backup H/W Recovery: Hardware/Software checkpoint and rollback Diagnosis: Rollback/replay on multicore Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware FaultErrorSymptom detected Recovery DiagnosisRepair Checkpoint

5 SWAT 4. Accurate Fault Modeling 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] 1.Detectors w/ Hardware support [ASPLOS ‘08] Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint

6 Hardware-Only Symptom-based detection Observe anomalous symptoms for fault detection –Incur low overheads for “always-on” detectors –Minimal support from hardware Fatal traps generated by hardware –Division by Zero, RED State, etc. Hangs detected using simple hardware hang detector High OS activity detected with performance counter –Typical OS invocations take 10s or 100s of instructions

7 Experimental Methodology Microarchitecture-level fault injection –GEMS timing models + Simics full-system simulation –SPEC workloads on Solaris-9 OS Permanent fault models –Stuck-at, bridging faults in latches of 8  arch structures –12,800 faults, <0.3% error @ 95% confidence Simulate impact of fault in detail for 10M instructions 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault App masked, or symptom > 10M, or silent data corruption (SDC)

8 Efficacy of Hardware-only Detectors Coverage: Percentage of unmasked faults detected –98% faults detected, 0.4% give SDC (w/o FPU)  Additional support required for FPU-like units –66% of detected faults corrupt OS state, need recovery  Despite low OS activity in fault-free execution Latency: Number of instr between activation and detection –HW recovery for upto 100k instr, SW longer latencies –App in 87% of detections recoverable using HW –OS recoverable in virtually all detections using HW  OS recovery using SW hard

9 Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? SDC faults primarily corrupt data values –Illegal control/address values caught by other symptoms –Need detectors to capture “semantic” information Software-level invariants capture program semantics –Use when higher coverage desired –Sound program invariants  expensive static analysis –We use likely program invariants

10 Likely Program Invariants Likely program invariants –Hold on all observed inputs, expected to hold on others –But suffer from false positives –Use SWAT diagnosis to detect false positives on-line iSWAT - Compiler-assisted symptom detectors –Range-based value invariants [Sahoo et al. DSN ‘08] –Check MIN  value  MAX on data values –Disable invariant when diagnose false-positive

11 iSWAT implementation Training Phase Application Compiler Pass in LLVM - - - - - Application - - - - - Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Test, train, external inputs

12 iSWAT implementation Training Phase Application Compiler Pass in LLVM - - - - - Application - - - - - Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Compiler Pass in LLVM - - - - - Application - - - - - Invariant Checking Code Full System Simulation Inject Faults SWAT Diagnosis Invariant Violation False Positive (Disable Invariant) Fault Detection Fault Detection Phase Test, train, external inputs Ref input

13 iSWAT Results Explored SWAT with 5 apps on previous methodology Undetected faults reduce by 30% Invariants reduce SDCs by 73% (33 to 9) Overheads: 5% on x86, 14% on UltraSparc IIIi –Reasonably low overheads on some machines –Un-optimized invariants used, can be further reduced Exploring more sophistication for  coverage,  overheads

14 Fault Diagnosis Symptom-based detection is cheap but –High latency from fault activation to detection –Difficult to diagnose root cause of fault –How to diagnose SW bug vs. transient vs. permanent fault? For permanent fault within core –Disable entire core? Wasteful! –Disable/reconfigure µarch-level unit? –How to diagnose faults to µarch unit granularity? Key ideas –Single core fault model, multicore  fault-free core available –Checkpoint/replay for recovery  replay on good core, compare –Synthesizing DMR, but only for diagnosis

15 SW Bug vs. Transient vs. Permanent Rollback/replay on same/different core Watch if symptom reappears No symptom Symptom False positive (iSWAT) or Deterministic s/w or Permanent h/w bug Symptom detected Faulty Good Rollback on faulty core Rollback/replay on good core Continue Execution Transient or non- deterministic s/w bug Symptom Permanent h/w fault, needs repair! No symptom False positive (iSWAT) or Deterministic s/w bug, send to s/w layer

16 Diagnosis Framework Permanent fault Microarchitecture-Level Diagnosis Unit X is faulty Symptom detected Diagnosis Software bug Transient fault

17 Fault-Free Core Execution Faulty Core Execution Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Diagnosis Algorithm =?

18 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty- core to checkpoint Replay execution, collect info =? Diagnosis Algorithm Fault-Free Core Execution

19 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Rollback faulty- core to checkpoint Replay execution, collect info =? Diagnosis Algorithm Load checkpoint on fault-free core Fault-free instruction exec What info to collect? What info to compare? What to do on divergence? Invoke TBFD

20 Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault sub r6,r1,r2 217 2 x 9 Faulty Fault-free HW used results add r1,r3,r5 0 dec alu 112 dst preg 5 x 3 Both divergent instructions used same ALU  ALU1 faulty

21 r2r2 p 20 4 Complex example: Fault in register alias table (RAT) entry Divergent instructions do not directly lead to faulty unit Instead, look backward/forward in instruction stream –Need to collect and analyze instruction trace Can a Divergent Instruction Lead to Diagnosis? r2r2 p 20 r1r1 logphy p4p4 r3r3 p 13 r5r5 p 24 RAT I A : r 3  r 2 + r 2 phyval p 20 4 p 24 3 Reg File p4p4 8 r3r3 p 55 error! r3r3 p 24 r5r5 3 8 I B : r 1  r 5 * r 2 r1r1 p4p4 p4p4 32 Fault-free r 1 =12 Diverged! But I B does not use faulty HW…

22 Diagnosing Permanent Fault to µarch Granularity Trace-based fault diagnosis (TBFD) –Compare instruction trace of faulty vs. good execution –Divergence  faulty hardware used  diagnosis clues Diagnose faults to µarch units of processor –Check µarch-level invariants in several parts of processor –Front end, Meta-datapath, datapath faults –Diagnosis in out-of-order logic (meta-datapath) complex Results –98% of the faults by SWAT successfully diagnosed –TBFD flexible for other detectors/granularity of repair

23 SWAT 4. Accurate Fault Modeling 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] 1.Detectors w/ Hardware support [ASPLOS ‘08] Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint

24 SWATSim: Fast and Accurate Fault Models Need accurate µarch-level fault models –Gate level injections accurate but too slow –µarch (latch) level injections fast but inaccurate Can we achieve µarch-level speed at gate-level accuracy? Mix-mode (hierarchical) Simulation –µarch-level + Gate-level simulation –Simulate only faulty component at gate-level, on-demand –Invoke gate-level sim at online for permanent faults  Simulating fault effect with real-world vectors

25 SWAT-Sim: Gate-level Accuracy at µarch Speeds µarch simulation r3  r1 op r2 Faulty Unit Used? Continue µarch simulation µarch-Level Simulation No Input Output Gate-Level Fault Simulation Stimuli Response Fault propagated to output Yes r3

26 Results from SWAT-Sim SWAT-sim implemented within full-system simulation –NCVerilog + VPI for gate-level sim of ALU/AGEN modules SWAT-Sim: High accuracy at low overheads –100,000x faster than gate-level, same modeling fidelity –2x slowdown over µarch-level, at higher accuracy Accuracy of µarch models using SWAT coverage/latency –µarch stuck-at models generally inaccurate –Differences in activation rate, multi-bit flips Complex manifestations  Hard to derive better models –Need SWAT-Sim, at least for now

27 SWAT Summary SWAT: SoftWare Anomaly Treatment –Handle all and only faults that matter –Low, amortized overheads –Holistic systems view enables novel solutions –Customizable and flexible Prior results: –Low-cost h/w detectors gave high coverage, low SDC rate This talk: –iSWAT: Higher coverage w/ software-assisted detectors –TBFD: µarch level fault diagnosis by synthesizing DMR –SWAT-Sim: Gate-level fault accuracy at µarch level speed

28 Future Work Recovery: hybrid, application-specific Aggressive use of software reliability techniques –Leverage diagnosis mechanism Multithreaded software Off-core faults Post-silicon debug and test –Use faulty trace as fault-model oblivious test vector Validation on FPGA (w/ Michigan) Hardware assertions to complement software symptoms

BACKUP SLIDES

30 Breakup of Detections by SW symptoms 98% unmasked faults detected within 10M instr (w/o FPU) –Need HW support or SW monitoring for FPU

31 SW Components Corrupted 66% of faults corrupt system state before detection –Need to recover system state

32 Latency from Application mismatch 86% of faults detected under 100k –42% detected under 10k

33 Latency from OS mismatch 99% of faults detected under 100k

34 iSWAT implementation Training Phase Application Compiler Pass in LLVM - - - - - Application - - - - - Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Compiler Pass in LLVM - - - - - Application - - - - - Invariant Checking Code Full System Simulation Inject Faults SWAT Diagnosis Invariant Violation False Positive (Disable Invariant) Fault Detection Fault Detection Phase Test, train, external inputs Ref input

35 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faulty- core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec TBFD Faults in Front-end Meta-datapath Faults Datapath Faults Faulty traceTest trace =?

36 Fault Diagnosability 98% of detected faults are diagnosed –89% diagnosed to unique unit/array entry –Meta-datapath faults in out-of-order exec mislead TBFD

37 Accuracy of existing Fault Models SWAT-sim implemented within full-system simulator –NCVerilog + VPI to simulate gate-level ALU and AGEN Existing µarch-level fault models inaccurate –Differences in activation rate, multi-bsit flips Accurate models hard to derive  need SWAT-Sim!

38 Summary: SWAT Advantages Handles all faults that matter –Oblivious to low-level failure modes & masked faults Low, amortized overheads –Optimize for common case, exploit s/w reliability solutions Holistic systems view enables novel solutions –Invariant detectors use diagnosis mechanisms –Diagnosis uses recovery mechanisms Customizable and flexible –Firmware based control affords hybrid, app-specific recovery (TBD) Beyond hardware reliability –SWAT treats hardware faults as software bugs  Long-term goal: unified system (hw + sw) reliability at lowest cost –Potential applications to post-silicon test and debug

39 Transients Results 6400 transient faults injected across 8 structures 83% unmasked faults detected within 10M instr Only 0.4% of injected faults results in SDCs

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

Similar presentations

Presentation on theme: "SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

Similar presentations

Presentation on theme: "SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet."— Presentation transcript:

Similar presentations

About project

Feedback