SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT: Designing Resilient Hardware by Treating Software Anomalies
Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign 1

Today: low-cost solution for multiple failure sources
Motivation Hardware failures will happen in the field Aging, soft errors, inadequate burn-in, design defects, …  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets Traditional redundancy (e.g., nMR) too expensive Piecemeal solutions for specific fault model too expensive Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources 2

 SWAT: SoftWare Anomaly Treatment
Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment 3

SWAT Framework Components
Detection: Symptoms of software misbehavior, minimal backup hardware Recovery: Hardware/software checkpoint and rollback Diagnosis: Rollback/replay on multicore Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4

1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 5

Simple Hardware-only Symptom-based detection
Observe anomalous symptoms for fault detection Incur low overheads for “always-on” detectors Minimal support from hardware, no software support Fatal traps generated by hardware Division by Zero, RED State, etc. Hangs detected using simple hardware hang detector High OS activity detected with performance counter Typical OS invocations take 10s or 100s of instructions 6

Experimental Methodology
Microarchitecture-level fault injection GEMS ooo timing models + Simics full-system simulation SPEC apps on OpenSolaris, UltraSPARC III ISA Fault model Stuck-at, bridging faults in latches of 8 arch structures 12,800 faults, <0.3% 95% confidence Also studied transients, but this talk on permanents Simulate impact of fault in detail for 10M instructions 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault App masked, or symptom > 10M, or silent data corruption (SDC) 7

Efficacy of Simple HW Only Detectors - Coverage
Permanent faults 98% of unmasked faults detected in 10M instr (w/o FPU) 0.4% of injected faults result in SDC (w/o FPU) Need hardware support or other monitors for FPU 8

Latency to Detection from Software Corruption
88% detected within 100K instructions, rest within 10M instr Can use hardware recovery methods – SafetyNet, Revive 9

SWAT approach feasible and attractive
Conclusions So Far SWAT approach feasible and attractive Very low-cost hardware detectors already effective 98% coverage, only 0.4% SDC for 7 of 8 structures Next Can we get even better coverage, especially SDC rate? 10

Improving SWAT Detection Coverage
Can we improve coverage, SDC rate further? SDC faults primarily corrupt data values Illegal control/address values caught by other symptoms Need detectors to capture “semantic” information Software-level invariants capture program semantics Use when higher coverage desired Sound program invariants  expensive static analysis We use likely program invariants 12

Likely Program Invariants
Hold on all observed inputs, expected to hold on others But suffer from false positives Use SWAT diagnosis to detect false positives on-line iSWAT invariant detectors Range-based value invariants [Sahoo et al. DSN ‘08] Check MIN  value  MAX on data values Disable invariant when diagnose false-positive 13

iSWAT Implementation Training Phase Test, train, external inputs
Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code Application Range i/p #1 Range i/p #n Invariant Ranges 14

Full System Simulation
iSWAT Implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code Test, train, external inputs Application Ref input Invariant Monitoring Code Application Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges 15

iSWAT Results Evaluated iSWAT on 5 apps w/ previous methodology
Key results Undetected faults reduce by 30% SDCs reduce by 73% (33 to 9) Runtime overhead 5% on x86, 14% on UltraSparc IIIi Can be further reduced with optimized invariants Exploring more sophistication to  coverage,  overhead 16

Fault Diagnosis Symptom-based detection is cheap but
High latency from fault activation to detection Difficult to diagnose root cause of fault How to diagnose SW bug vs. transient vs. permanent fault? For permanent fault within core Disable entire core? Wasteful! Disable/reconfigure µarch-level unit? How to diagnose faults to µarch unit granularity? Key ideas Single core fault model, multicore  fault-free core available Checkpoint/replay for recovery  replay on good core, compare Synthesizing DMR, but only for diagnosis 18

SW Bug vs. Transient vs. Permanent
Rollback/replay on same/different core Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core No symptom Symptom Permanent h/w bug or deterministic s/w bug or false positive (iSWAT) Continue Execution Transient h/w bug or non-deterministic s/w bug Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or deterministic s/w bug (send to s/w layer) 19

Microarchitecture-Level Granularity Diagnosis
Diagnosis Framework Symptom detected Diagnosis Software bug Transient fault Permanent fault Microarchitecture-Level Granularity Diagnosis Unit X is faulty 20

Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm 21

Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Fault-Free Core Execution Replay execution, collect info =? Diagnosis Algorithm 22

Permanent fault detected What info to collect? Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec What to do on divergence? =? What info to compare? Diagnosis Algorithm 23

Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Synch state Faulty trace =? divergence Test trace Diagnosis Algorithm: 1. Front-end 2. Meta-datapath 3. Datapath 24

Diagnosis Results 98% of detected faults are diagnosed
89% diagnosed to unique unit/array entry Meta-datapath faults in out-of-order execution mislead TBFD 25

SwatSim: Fast and Accurate Fault Modeling
Need accurate µarch-level fault models Gate level injections accurate but too slow µarch (latch) level injections fast but inaccurate Can we achieve µarch-level speed at gate-level accuracy? SwatSim – Hierarchical (mixed mode) simulation Simulate mostly at µarch level Simulate only faulty component at gate-level, on-demand Invoke gate-level simulation online for permanent faults Simulating fault effect with real-world vectors Used OpenSPARC RTL models 27

SWAT-Sim: Gate-level Accuracy at µarch Speeds
µarch simulation r3  r1 op r2 Faulty Unit Used? Gate-Level Fault Simulation Stimuli Response Fault propagated to output Yes µarch-Level Simulation No Input Output r3 Continue µarch simulation 28

Results from SwatSim SwatSim implemented within full-system simulation
GEMS+Simics for µarch simulation NCVerilog + VPI for gate-level ALU, AGEN from OpenSPARC models Performance overhead 100,000X faster than gate level full processor simulation 2X slowdown over µarch level simulation Accuracy of µarch fault models using SWAT coverage/latency Compared µarch stuck-at with SwatSim stuck-at, delay µarch fault models generally inaccurate Accuracy varies depending on structure, fault model Differences in activation rate, multi-bit flips Unsuccessful attempts to derive more accurate µarch fault models  Need SwatSim, at least for now 29

3. Trace-Based Fault Diagnosis [DSN ’08b]
Summary – SWAT Works! 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 30

Summary: SWAT Advantages
Handles all faults that matter Oblivious to low-level failure modes & masked faults Low, amortized overheads Optimize for common case, exploit s/w reliability solutions Holistic systems view enables novel, synergistic solutions Invariant detectors use diagnosis mechanisms Diagnosis uses recovery mechanisms Customizable and flexible Firmware control can adapt to specific reliability needs E.g., hybrid, app-specific recovery (TBD) Beyond hardware reliability SWAT treats hardware faults as software bugs Long-term goal: unified system (hw + sw) reliability at lowest cost Potential applications to post-silicon test and debug 31

Ongoing and Future Work
Complete SWAT system implementation Recovery and firmware control w/ OpenSPARC hypervisor/OS Multithreaded software on multicore: Initial results promising More aggressive detection More aggressive software reliability techniques H/W assertions to complement software (w/ Shobha Vasudevan) Modeling Comprehensive SWATSim w/ OpenSPARC RTL for more h/w modules Off-core faults Validation on FPGA (w/ Michigan) using Leon based system Would be nice to have state-of-the-art multicore SPARC system Post-silicon debug and test Engagements with Sun Student summer intern w/ Dr. Ishwar Parulkar, teleconferences, visits 32

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Similar presentations

Presentation on theme: "SWAT: Designing Resilient Hardware by Treating Software Anomalies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Similar presentations

Presentation on theme: "SWAT: Designing Resilient Hardware by Treating Software Anomalies"— Presentation transcript:

Similar presentations

About project

Feedback