SWAT: Designing Resilient Hardware by Treating Software Anomalies

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Concurrent Systems Architecture Group University of California, San Diego and University of Illinois at Urbana-Champaign Morph 9/21/98 Morph: Supporting.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
Automated Design of Custom Architecture Tulika Mitra
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Optimistic Hybrid Analysis
On-Demand Dynamic Software Analysis
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Ph.D. in Computer Science
Chapter 9: Virtual Memory – Part I
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
VLSI Testing Lecture 14: System Diagnosis
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
Application-Specific Customization of Soft Processor Microarchitecture
Chapter 8 – Software Testing
Modeling Stream Processing Applications for Dependability Evaluation
nZDC: A compiler technique for near-Zero silent Data Corruption
Effective Data-Race Detection for the Kernel
Lazy Diagnosis of In-Production Concurrency Bugs
VLSI Testing Lecture 6: Fault Simulation
Lecture 7 Fault Simulation
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES
VLSI Testing Lecture 6: Fault Simulation
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Soft Error Detection for Iterative Applications Using Offline Training
Mattan Erez The University of Texas at Austin July 2015
Yikes! Why is my SystemVerilog Testbench So Slooooow?
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
InCheck: An In-application Recovery Scheme for Soft Errors
Hardware Counter Driven On-the-Fly Request Signatures
Co-designed Virtual Machines for Reliable Computer Systems
Patrick Akl and Andreas Moshovos AENAO Research Group
Application-Specific Customization of Soft Processor Microarchitecture
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Abstractions for Fault Tolerance
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

SWAT: Designing Resilient Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu 1

Today: low-cost solution for multiple failure sources Motivation Hardware failures will happen in the field Aging, soft errors, inadequate burn-in, design defects, …  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets Traditional redundancy (e.g., nMR) too expensive Piecemeal solutions for specific fault model too expensive Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources 2

 SWAT: SoftWare Anomaly Treatment Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment 3

SWAT Framework Components Detection: Symptoms of software misbehavior, minimal backup hardware Recovery: Hardware/software checkpoint and rollback Diagnosis: Rollback/replay on multicore Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4

SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 5

Simple Hardware-only Symptom-based detection Observe anomalous symptoms for fault detection Incur low overheads for “always-on” detectors Minimal support from hardware, no software support Fatal traps generated by hardware Division by Zero, RED State, etc. Hangs detected using simple hardware hang detector High OS activity detected with performance counter Typical OS invocations take 10s or 100s of instructions 6

Experimental Methodology Microarchitecture-level fault injection GEMS ooo timing models + Simics full-system simulation SPEC apps on OpenSolaris, UltraSPARC III ISA Fault model Stuck-at, bridging faults in latches of 8 arch structures 12,800 faults, <0.3% error @ 95% confidence Also studied transients, but this talk on permanents Simulate impact of fault in detail for 10M instructions 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault App masked, or symptom > 10M, or silent data corruption (SDC) 7

Efficacy of Simple HW Only Detectors - Coverage Permanent faults 98% of unmasked faults detected in 10M instr (w/o FPU) 0.4% of injected faults result in SDC (w/o FPU) Need hardware support or other monitors for FPU 8

Latency to Detection from Software Corruption 88% detected within 100K instructions, rest within 10M instr Can use hardware recovery methods – SafetyNet, Revive 9

SWAT approach feasible and attractive Conclusions So Far SWAT approach feasible and attractive Very low-cost hardware detectors already effective 98% coverage, only 0.4% SDC for 7 of 8 structures Next Can we get even better coverage, especially SDC rate? 10

SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 11

Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? SDC faults primarily corrupt data values Illegal control/address values caught by other symptoms Need detectors to capture “semantic” information Software-level invariants capture program semantics Use when higher coverage desired Sound program invariants  expensive static analysis We use likely program invariants 12

Likely Program Invariants Hold on all observed inputs, expected to hold on others But suffer from false positives Use SWAT diagnosis to detect false positives on-line iSWAT invariant detectors Range-based value invariants [Sahoo et al. DSN ‘08] Check MIN  value  MAX on data values Disable invariant when diagnose false-positive 13

iSWAT Implementation Training Phase Test, train, external inputs Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Range i/p #1 Range i/p #n . . . . Invariant Ranges 14

Full System Simulation iSWAT Implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code Test, train, external inputs - - - - - Application Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges 15

iSWAT Results Evaluated iSWAT on 5 apps w/ previous methodology Key results Undetected faults reduce by 30% SDCs reduce by 73% (33 to 9) Runtime overhead 5% on x86, 14% on UltraSparc IIIi Can be further reduced with optimized invariants Exploring more sophistication to  coverage,  overhead 16

SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 17

Fault Diagnosis Symptom-based detection is cheap but High latency from fault activation to detection Difficult to diagnose root cause of fault How to diagnose SW bug vs. transient vs. permanent fault? For permanent fault within core Disable entire core? Wasteful! Disable/reconfigure µarch-level unit? How to diagnose faults to µarch unit granularity? Key ideas Single core fault model, multicore  fault-free core available Checkpoint/replay for recovery  replay on good core, compare Synthesizing DMR, but only for diagnosis 18

SW Bug vs. Transient vs. Permanent Rollback/replay on same/different core Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core No symptom Symptom Permanent h/w bug or deterministic s/w bug or false positive (iSWAT) Continue Execution Transient h/w bug or non-deterministic s/w bug Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or deterministic s/w bug (send to s/w layer) 19

Microarchitecture-Level Granularity Diagnosis Diagnosis Framework Symptom detected Diagnosis Software bug Transient fault Permanent fault Microarchitecture-Level Granularity Diagnosis Unit X is faulty 20

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm 21

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Fault-Free Core Execution Replay execution, collect info =? Diagnosis Algorithm 22

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected What info to collect? Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec What to do on divergence? =? What info to compare? Diagnosis Algorithm 23

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Synch state Faulty trace =? divergence Test trace Diagnosis Algorithm: 1. Front-end 2. Meta-datapath 3. Datapath 24

Diagnosis Results 98% of detected faults are diagnosed 89% diagnosed to unique unit/array entry Meta-datapath faults in out-of-order execution mislead TBFD 25

SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 26

SwatSim: Fast and Accurate Fault Modeling Need accurate µarch-level fault models Gate level injections accurate but too slow µarch (latch) level injections fast but inaccurate Can we achieve µarch-level speed at gate-level accuracy? SwatSim – Hierarchical (mixed mode) simulation Simulate mostly at µarch level Simulate only faulty component at gate-level, on-demand Invoke gate-level simulation online for permanent faults Simulating fault effect with real-world vectors Used OpenSPARC RTL models 27

SWAT-Sim: Gate-level Accuracy at µarch Speeds µarch simulation r3  r1 op r2 Faulty Unit Used? Gate-Level Fault Simulation Stimuli Response Fault propagated to output Yes µarch-Level Simulation No Input Output r3 Continue µarch simulation 28

Results from SwatSim SwatSim implemented within full-system simulation GEMS+Simics for µarch simulation NCVerilog + VPI for gate-level ALU, AGEN from OpenSPARC models Performance overhead 100,000X faster than gate level full processor simulation 2X slowdown over µarch level simulation Accuracy of µarch fault models using SWAT coverage/latency Compared µarch stuck-at with SwatSim stuck-at, delay µarch fault models generally inaccurate Accuracy varies depending on structure, fault model Differences in activation rate, multi-bit flips Unsuccessful attempts to derive more accurate µarch fault models  Need SwatSim, at least for now 29

3. Trace-Based Fault Diagnosis [DSN ’08b] Summary – SWAT Works! 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b] 30

Summary: SWAT Advantages Handles all faults that matter Oblivious to low-level failure modes & masked faults Low, amortized overheads Optimize for common case, exploit s/w reliability solutions Holistic systems view enables novel, synergistic solutions Invariant detectors use diagnosis mechanisms Diagnosis uses recovery mechanisms Customizable and flexible Firmware control can adapt to specific reliability needs E.g., hybrid, app-specific recovery (TBD) Beyond hardware reliability SWAT treats hardware faults as software bugs Long-term goal: unified system (hw + sw) reliability at lowest cost Potential applications to post-silicon test and debug 31

Ongoing and Future Work Complete SWAT system implementation Recovery and firmware control w/ OpenSPARC hypervisor/OS Multithreaded software on multicore: Initial results promising More aggressive detection More aggressive software reliability techniques H/W assertions to complement software (w/ Shobha Vasudevan) Modeling Comprehensive SWATSim w/ OpenSPARC RTL for more h/w modules Off-core faults Validation on FPGA (w/ Michigan) using Leon based system Would be nice to have state-of-the-art multicore SPARC system Post-silicon debug and test Engagements with Sun Student summer intern w/ Dr. Ishwar Parulkar, teleconferences, visits 32