Presentation is loading. Please wait.

Presentation is loading. Please wait.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,

Similar presentations


Presentation on theme: "SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,"— Presentation transcript:

1 SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

2 Motivation Hardware will fail in-the-field due to several reasons  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets –Traditional redundancy solutions (e.g., nMR) too expensive  Need low-cost solutions for multiple failure sources  Must incur low area, performance, power overhead Transient errors (High-energy particles ) Wear-out (Devices are weaker) Design Bugs … and so on

3 Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) – Zero to low overhead “always-on” monitors Diagnose cause after symptom detected −May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

4 SWAT Framework Components Detection: Symptoms of software misbehavior Recovery: Checkpoint and rollback Diagnosis: Rollback/replay on multicore Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware FaultErrorSymptom detected Recovery DiagnosisRepair Checkpoint

5 Advantages of SWAT Handles all faults that matter –Oblivious to low-level failure modes and masked faults Low, amortized overheads –Optimize for common case, exploit SW reliability solutions Customizable and flexible –Firmware control adapts to specific reliability needs Holistic systems view enables novel solutions –Synergistic detection, diagnosis, recovery solutions Beyond hardware reliability –Long term goal: unified system (HW+SW) reliability –Potential application to post-silicon test and debug

6 SWAT Contributions In-situ diagnosis [DSN’08] Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint Accurate fault modeling [HPCA’09] Multithreaded workloads [MICRO’09] Application-Aware SWAT Even lower SDC, latency

7 This Talk In-situ diagnosis [DSN’08] Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint Accurate fault modeling [HPCA’09] Multithreaded workloads [MICRO’09] Application-Aware SWAT Even lower SDC, latency

8 Outline Introduction to SWAT SWAT Detection SWAT Diagnosis Analysis of Recovery in SWAT Conclusions Future work

9 Fault Detection w/ HW Detectors [ASPLOS ’08] Simple HW-only detectors to observe anomalous SW behavior –Minimal hardware area  low cost detectors –Incur near-zero perf overhead in fault-free operation –Require no changes to SW Fatal Traps Division by zero, RED state, etc. Kernel Panic OS enters panic State due to fault High OS High contiguous OS activity Hangs Simple HW hang detector App Abort Application abort due to fault

10 Fault Detection w/ SW-assisted Detectors Simple HW detectors effective, require no SW changes SW-assisted detectors to augment HW detectors –Minimal changes to SW for more effective detectors Amortize resiliency cost with SW bug detection Explored two simple SW-assisted schemes –Detecting out-of-bounds addresses  Low HW overhead, near-zero impact on performance –Using likely program invariants  Instrumented binary, no HW changes  <5% performance overhead on x86 processors

11 Fault Detection w/ SW-assisted Detectors Address out-of-bounds Detector –Monitor boundaries of heap, stack, globals –Address beyond these bounds  HW fault –HW-only detect such faults at longer latency iSWAT: Using likely program invariants to detect HW faults –Mine “likely” invariants on data values  E.g., 10 ≤ x ≤ 20  Hold on observed inputs, expected to hold on others –Violation of likely invariant  HW fault –Useful to detect faults that affect only data  iSWAT not explored in this talk [Sahoo et al., DSN ‘08] App Code Globals Heap Stack Libraries Empty Reserved App Address Space

12 Evaluating Fault Detection Microarchitecture-level fault injection (latch elements) –GEMS timing models + Simics full-system simulation –All SPEC 2k C/C++ workloads in 64-bit OpenSolaris OS Stuck-at, transient faults in 8 µarch units (single fault model) –10,000 of each type  statistically significant Simulate impact of fault in detail for 10M instructions 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault Masked or Silent Data Corruption (SDC) Metrics: SDC rate, detection latency

13 SDC Rate of HW-only Detectors Simple detectors give 0.7% SDC rate for permanent faults Faults in FPU need better detectors –Mostly corrupt only data  iSWAT may detect

14 SDC Rate of HW-only Detectors Transient faults also have low SDC rate of 0.3% High rate of masking from transients –Consistent with prior work on transients

15 Application-Aware SDC Analysis SDCs  undetected faults that corrupt only data values –SWAT detectors catch other corruptions –Most faults do not corrupt only data values But some “SDCs” are actually acceptable outputs! –Traditionally, SDC  output differs from fault-free output –But different outputs may still be acceptable  Diff solutions, diff solutions with degraded quality, etc.  E.g., Same cost place & route, acceptable PSNR, etc. SWAT detectors cannot detect acceptable changes in output –For each app, define % degradation in output quality should not

16 10/16 SPEC have multiple correct solutions (results for all) App-aware analysis  remarkably low SDC rate for SWAT –Only 28 faults show >0% degradation from golden output –10 of >16,000 injected faults are SDC at >1% degradation Ongoing work: Formalization of why/when SWAT works Application-Aware SDC Analysis

17 Detection latency dictates recoverability –Fault recoverable as long as fault-free checkpoint exists Traditional detection lat = arch state corruption to detection –Checkpoint records bad arch state  SW affected But not all arch state corruptions affect SW output New detection latency = SW state corruption to detection Detection Latency Bad SW state New Latency Bad arch state Old latency Fault Detection Recoverable chkpt Recoverable chkpt Recoverable chkpt

18 Detection Latency >98% of all faults detected within 10M instructions –Recoverable using HW checkpoint schemes

19 Detection Latency >98% of all faults detected within 10M instructions –Recoverable using HW checkpoint schemes Out-of-bounds detector further reduces detection latency –Many address violations  longer latency detections

20 Detection Latency Measuring new latency important to study recoverability -Significant differences between old and new latency

21 Fault Detection - Summary Simple detectors effective in detecting HW faults Low SDC rate even with HW-only detectors Short detection latencies for hardware faults –SW-assisted out-of-bounds detector  latency further –Measuring new detection latency important for recovery Next: Diagnosis of detected faults

22 Fault Diagnosis Symptom-based detection is cheap but –May incur long latency from activation to detection –Difficult to diagnose root cause of fault Goal: Diagnose the fault with minimal hardware overhead –Rarely invoked  higher perf overhead acceptable SW Bug Transient Fault Permanent Fault Symptom ?

23 SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08] First, diagnosis for single threaded workload on one core –Multithreaded w/ multicore later – several new challenges Key ideas Single core fault model, multicore  fault-free core available Chkpt/replay for recovery  replay on good core, compare Synthesizing DMR, but only for diagnosis Traditional DMR P1P2 = Always on  expensive P1P2 = P1 Synthesized DMR Fault-free DMR only on fault

24 SW Bug vs. Transient vs. Permanent Rollback/replay on same/different core Watch if symptom reappears No symptom Symptom Deterministic s/w or Permanent h/w bug Symptom detected Faulty Good Rollback on faulty core Rollback/replay on good core Continue Execution Transient or non- deterministic s/w bug Symptom Permanent h/w fault, needs repair! No symptom Deterministic s/w bug (send to s/w layer)

25 µarch-level Fault Diagnosis Permanent fault Microarchitecture-level Diagnosis Unit X is faulty Symptom detected Diagnosis Software bug Transient fault

26 Trace Based Fault Diagnosis (TBFD) µarch-level fault diagnosis using rollback/replay Key: Execution caused symptom  trace activates fault –Deterministically replay trace on faulty, fault-free cores –Divergence  faulty hardware used  diagnosis clues Diagnose faults to µarch units of processor –Check µarch-level invariants in several parts of processor –Diagnosis in out-of-order logic (meta-datapath) complex

27 Trace-Based Fault Diagnosis: Evaluation Goal: Diagnose faults at reasonable latency Faults diagnosed in 10 SPEC workloads –~8500 detected faults (98% of unmasked) Results –98% of the detection successfully diagnosed –91% diagnosed within 1M instr (~0.5ms on 2GHz proc)

28 SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09] Challenge 1: Deterministic replay involves high overhead Challenge 2: Multithreaded apps share data among threads Symptom causing core may not be faulty No known fault-free core in system Core 2 Fault Core 1 Symptom Detection on a fault-free core Store Memory Load

29 mSWAT Diagnosis - Key Ideas Challenges Multithreaded applications Full-system deterministic replay No known good core Isolated deterministic replay Emulated TMR Key Ideas TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD TATA A B C D TATA

30 mSWAT Diagnosis - Key Ideas Challenges Multithreaded applications Full-system deterministic replay No known good core Isolated deterministic replay Emulated TMR Key Ideas TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD A B C D TDTD TATA TBTB TCTC TCTC TDTD TATA TBTB

31 mSWAT Diagnosis: Evaluation Diagnose detected perm faults in multithreaded apps –Goal: Identify faulty core, TBFD for µarch-level diagnosis –Challenges: Non-determinism, no fault-free core known –~4% of faults detected from fault-free core Results –95% of detected faults diagnosed  All detections from fault-free core diagnosed –96% of diagnosed faults require <200KB buffers  Can be stored in lower level cache  low HW overhead SWAT diagnosis can work with other symptom detectors

32 SWAT Recovery Recovery masks effect of fault for continuous operation Checkpointing “always-on”  must incur minimal overhead –Low area overhead, minimal performance impact SWAT symptom detection assumes checkpoint recovery –Fault allowed to corrupt the architecture state Checkpoint/replay Rollback to pristine state, re-execute I/O buffering Prevent irreversible effects Recovery

33 Components of Recovery Checkpointing –Periodic snapshot of registers, undo log for memory –Restore register/memory state up on detection I/O buffering –External outputs buffered until known to be fault-free –HW buffer to record I/O until next checkpoint interval Registers Snapshot 1 Registers Snapshot 2 Registers Snapshot 3 Memory Log 1 ST old val ST Memory Log 2 ST old val I/O Buffer 1 Device I/O Commit I/O

34 Analysis of Recovery Overheads [Lead by Alex Li] Goal: Measure overheads from checkpointing, I/O buffering –Measured on 2 server applications – apache, sshd –ReVive for chkpt, several techniques for I/O buffering State-of-the-art incurs high overhead at short chkpt intervals –>30% performance overhead at interval of <1M cycles! Long chkpt interval  I/O buffering incurs high HW overhead –Checkpoint intervals of 10M  HW buffer of 100KB Push and pull effect between recovery components Ongoing work: SWAT recovery module with low overheads

35 Summary: SWAT works! In-situ diagnosis [DSN’08] Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint Accurate fault modeling [HPCA’09] Multithreaded workloads [MICRO’09] Application-Aware SWAT Even lower SDC, latency

36 SWAT Advantages and Limitations Advantages –Handles all faults that matter, oblivious to failure modes –Low, amortized overheads across HW/SW reliability –Customizable and flexible due to firmware implementation –Concepts applicable beyond hardware reliability Limitations –SWAT reliability guarantees largely empirical –SWAT firmware, recovery module not yet ready –Off core faults, other fault models not evaluated

37 Future Work Formalization of when/why SWAT works Near zero cost recovery More server/distributed applications Other core and off-core parts, other fault models Prototyping SWAT on FPGA –With T. Austin/ V. Bertacco at University of Michigan

38 SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

39 Backup Slides

40 Address Out-of-Bounds Detector Address faults may result in long detection latencies –Corrupt address unallocated but in valid page –Many data value corruptions before symptom Low-cost address out-of-bounds detector Can amortize cost across software bug detectors Compiler tells hardware Malloc reports to hw Limits recorded on function execution App Code Globals Heap Stack Libraries Empty Reserved 0x0 0x100000000 0xffff… (2 64 -1) App Address Space

41 Permanent Faults: HW-only Detectors Fatal Traps and Panics detect most faults Large fraction of detections by symptoms from OS

42 Measuring Detection Latency New detection latency = SW state corruption to detection But idenitfying SW state corruption is hard! –Need to know how faulty value used by application –If faulty value affects output, then SW state corrupted Measure latency by rolling back to older checkpoints –Only for analysis, not required in real system Fault Detection Bad arch stateBad SW state New latency Chkpt Rollback & Replay Symptom Chkpt Fault effect masked Rollback & Replay

43 Extending SWAT Diagnosis to Multithreaded Apps Naïve extension – N known good cores to replay the trace –Too expensive – area –Requires full-system deterministic replay Simple optimization – One spare core – Not Scalable, requires N full-system deterministic replays – Requires a spare core – Single point of failure C1 S C2C3 Symptom Detected C1 S C2C3 No Symptom Detected C1 S C2C3 Symptom Detected Faulty core is C2

44 mSWAT Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace Diagnosis TATA TBTB TCTC TDTD A B C D Example

45 mSWAT Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace Diagnosis TATA TBTB TCTC TDTD A B C D Example

46 mSWAT Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace Faulty core Look for divergence Diagnosis TATA TBTB TCTC TDTD A B C D TDTD TATA TBTB TCTC Divergence Example TATA A B C D No Divergence Faulty core is B

47 Recording Fault Activating Trace Capture fault activating trace Deterministic isolated replay Faulty core Look for divergence What info to capture for deterministic isolated replay? Thread Ld Capture all inputs to thread as trace -Record data values of all loads Ensures isolated deterministic replay -Isolated replay => lower overhead Symptom Detected

48 Comparing Deterministic Replays Compare all instr  Large buffer needed Faults -> SW through branch, load, store -Other faults manifest in these instr Record, compare only these instructions -Lower HW buffer overhead Capture fault activating trace Deterministic isolated replay Faulty core Look for divergence Symptom Detected Thread Store Branch Load How to identify divergence?

49 mSWAT Diagnosis: Hardware Cost Trace captured in native execution –HW support for trace collection Deterministic replay is firmware emulated –Requires minimal hardware support –Replay threads in isolation  No need to capture memory orderings Long detection latency  large trace buffers (8MB/core) –Iterative Diagnosis Algorithm to reduce buffer overhead Capture fault activating trace Deterministic isolated replay Faulty core Look for divergence Symptom Detected Repeatedly execute on short traces e.g. 100,000 instrns

50 Results: MSWAT Fault Diagnosis Over 95% of detected faults are successfully diagnosed All faults detected in fault-free core are diagnosed

51 Fault-Free Core Execution Faulty Core Execution Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Diagnosis Algorithm =?

52 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm Load checkpoint on fault-free core Fault-free instruction exec What info to collect? What info to compare? What to do on divergence? Invoke TBFD

53 Rollback faulty-core to checkpoint Replay execution, collect µarch info Faulty trace Load checkpoint on fault-free core Fault-free instruction exec Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Test trace =? Synch state divergence Invoke TBFD Diagnosis Algorithm

54 What to buffer, checkpoint: Methodology Buffer requirements of I/O intensive workloads –Apache, SSH daemon serving client requests on network Server daemon serving multiple client threads –Fault injection and detection only at server  SDC rate similar to single-threaded, focus on recovery After detection, rollback to checkpoint, replay without fault –Recoverable, Detected Unrecoverable Error (DUE), SDC Fault Application OS Hardware Application OS Hardware Simulated ServerSimulated Client Simulated Network SIMIC S

55 SWAT Recovery Recovery: Mask effect of error for continuous operation Is I/O buffering required? – Previous software solutions vulnerable to hardware faults Overheads for I/O buffering and checkpointing? – I/O buffering interval > 2 * detection latency – Chkpt interval * # of chkpts > 2 * detection latency Checkpoint/replay Rollback to pristine state, re-execute I/O buffering Prevent irreversible effects

56 What to Checkpoint and Buffer? CPU Memory SCSI Controller Host-PCI bridge Console Network Checkpointed state Device-to-memory write Proc+Mem Proc+AllMem FullSystem Proc+AllMem+OutBuf CPU-to-device write Buffer

57 What to Checkpoint, Buffer: Methodology Buffer requirements of I/O intensive workloads –Apache, SSH daemon serving client requests on network Server daemon serving multiple client threads –Fault injection and detection only at server  SDC rate similar to single-threaded, focus on recovery After detection, rollback to checkpoint, replay without fault –Recoverable, Detected Unrecoverable Error (DUE), SDC Fault Application OS Hardware Application OS Hardware Simulated ServerSimulated Client Simulated Network SIMIC S

58 What to Checkpoint, Buffer? Need output buffering for full recovery!

59 How Much Output Buffering? Monitored CPU-to-device writes for different intervals Interval of 10M needs > 100K output buffer 10K needs < 1K buffer! 10k 100k 1M 10M 100M

60 How Much Checkpoint Overhead? Used state of the art: ReVive –Effect of different cache sizes, intervals unknown Metholodgy –16-core multicore with shared L2 –4 SPLASH parallel apps (worst-case for original ReVive) –Vary L2 cache size between 256KB to 2048KB –Vary checkpoint interval between 500K to 50M

61 Overhead of Hardware Checkpointing Ocean Intervals, cache sizes have large impact on performance < 5M checkpoint intervals can have unacceptable overheads –But this requires 100 KB output buffer! –Need cheaper checkpoint mechanisms

62 SwatSim: Fast and Accurate Fault Modeling Need for accurate µarch-level fault models –Must be able to observe system level effects –µarch (latch) level injections fast but inaccurate –Gate level injections accurate but too slow Can we achieve µarch-level speed at gate-level accuracy? SwatSim - Gate-level accuracy at µarch-level speeds –Simulate mostly at µarch level –Simulate only faulty component at gate-level, on-demand –Permanent faults require back-and-forth between levels

63 SWAT-Sim: Gate-level Accuracy at µarch Speeds µarch simulation r3  r1 op r2 Faulty Unit Used? Continue µarch simulation µarch-Level Simulation No Input Output Gate-Level Fault Simulation Stimuli Response Fault propagated to output Yes r3

64 SwatSim Results SwatSim implemented within full-system simulation –GEMS+Simics for µarch and functional simulations –NCVerilog + VPI for gate-level ALU, AGEN Performance overhead –100,000X faster than gate level full processor simulation –2X slowdown over µarch level simulation Accuracy of µarch fault models using SWAT coverage/latency –Compared µarch stuck-at with SwatSim stuck-at, delay –µarch fault models generally inaccurate  Accuracy varies depending on structure, fault model  Differences in activation rate, multi-bit flips Unsuccessful attempts to derive more accurate µarch fault models  Need SwatSim, at least for now


Download ppt "SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,"

Similar presentations


Ads by Google