SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Advertisements

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.

An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.

Yuanyuan ZhouUIUC-CS Architectural Support for Software Bug Detection Yuanyuan (YY) Zhou and Josep Torrellas University of Illinois at Urbana-Champaign.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

Concurrent Systems Architecture Group University of California, San Diego and University of Illinois at Urbana-Champaign Morph 9/21/98 Morph: Supporting.

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.

Software Faults and Fault Injection Models --Raviteja Varanasi.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Fault-Tolerant Systems Design Part 1.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Detecting Errors Using Multi-Cycle Invariance Information Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence,

SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:

Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.

Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.

Learning-Based Power Modeling of System-Level Black-Box IPs Dongwook Lee, Taemin Kim, Kyungtae Han, Yatin Hoskote, Lizy K. John, Andreas Gerstlauer.

Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,

Optimistic Hybrid Analysis

Raghuraman Balasubramanian Karthikeyan Sankaralingam

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

Modeling Stream Processing Applications for Dependability Evaluation

nZDC: A compiler technique for near-Zero silent Data Corruption

SWAT: Designing Resilient Hardware by Treating Software Anomalies

InCheck – An Integrated Recovery Methodology for nZDC

Daya S Khudia, Griffin Wright and Scott Mahlke

Hwisoo So. , Moslem Didehban#, Yohan Ko

Fault Injection: A Method for Validating Fault-tolerant System

Soft Error Detection for Iterative Applications Using Offline Training

InCheck: An In-application Recovery Scheme for Soft Errors

Instruction Execution Cycle

Co-designed Virtual Machines for Reliable Computer Systems

Patrick Akl and Andreas Moshovos AENAO Research Group

Presentation transcript:

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign

2 Motivation Hardware failures will happen in the field –Aging, soft errors, inadequate burn-in, design defects, …  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets –Traditional redundancy (e.g., nMR) too expensive –Piecemeal solutions for specific fault model too expensive –Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

3 Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

4 SWAT Framework Components Detection: Symptoms of S/W misbehavior, minimal backup H/W Recovery: Hardware/Software checkpoint and rollback Diagnosis: Rollback/replay on multicore Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware FaultErrorSymptom detected Recovery DiagnosisRepair Checkpoint

5 SWAT 4. Accurate Fault Modeling 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] 1.Detectors w/ Hardware support [ASPLOS ‘08] Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint

6 Hardware-Only Symptom-based detection Observe anomalous symptoms for fault detection –Incur low overheads for “always-on” detectors –Minimal support from hardware Fatal traps generated by hardware –Division by Zero, RED State, etc. Hangs detected using simple hardware hang detector High OS activity detected with performance counter –Typical OS invocations take 10s or 100s of instructions

7 Experimental Methodology Microarchitecture-level fault injection –GEMS timing models + Simics full-system simulation –SPEC workloads on Solaris-9 OS Permanent fault models –Stuck-at, bridging faults in latches of 8  arch structures –12,800 faults, <0.3% 95% confidence Simulate impact of fault in detail for 10M instructions 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault App masked, or symptom > 10M, or silent data corruption (SDC)

8 Efficacy of Hardware-only Detectors Coverage: Percentage of unmasked faults detected –98% faults detected, 0.4% give SDC (w/o FPU)  Additional support required for FPU-like units –66% of detected faults corrupt OS state, need recovery  Despite low OS activity in fault-free execution Latency: Number of instr between activation and detection –HW recovery for upto 100k instr, SW longer latencies –App in 87% of detections recoverable using HW –OS recoverable in virtually all detections using HW  OS recovery using SW hard

9 Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? SDC faults primarily corrupt data values –Illegal control/address values caught by other symptoms –Need detectors to capture “semantic” information Software-level invariants capture program semantics –Use when higher coverage desired –Sound program invariants  expensive static analysis –We use likely program invariants

10 Likely Program Invariants Likely program invariants –Hold on all observed inputs, expected to hold on others –But suffer from false positives –Use SWAT diagnosis to detect false positives on-line iSWAT - Compiler-assisted symptom detectors –Range-based value invariants [Sahoo et al. DSN ‘08] –Check MIN  value  MAX on data values –Disable invariant when diagnose false-positive

11 iSWAT implementation Training Phase Application Compiler Pass in LLVM Application Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Test, train, external inputs

12 iSWAT implementation Training Phase Application Compiler Pass in LLVM Application Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Compiler Pass in LLVM Application Invariant Checking Code Full System Simulation Inject Faults SWAT Diagnosis Invariant Violation False Positive (Disable Invariant) Fault Detection Fault Detection Phase Test, train, external inputs Ref input

13 iSWAT Results Explored SWAT with 5 apps on previous methodology Undetected faults reduce by 30% Invariants reduce SDCs by 73% (33 to 9) Overheads: 5% on x86, 14% on UltraSparc IIIi –Reasonably low overheads on some machines –Un-optimized invariants used, can be further reduced Exploring more sophistication for  coverage,  overheads

14 Fault Diagnosis Symptom-based detection is cheap but –High latency from fault activation to detection –Difficult to diagnose root cause of fault –How to diagnose SW bug vs. transient vs. permanent fault? For permanent fault within core –Disable entire core? Wasteful! –Disable/reconfigure µarch-level unit? –How to diagnose faults to µarch unit granularity? Key ideas –Single core fault model, multicore  fault-free core available –Checkpoint/replay for recovery  replay on good core, compare –Synthesizing DMR, but only for diagnosis

15 SW Bug vs. Transient vs. Permanent Rollback/replay on same/different core Watch if symptom reappears No symptom Symptom False positive (iSWAT) or Deterministic s/w or Permanent h/w bug Symptom detected Faulty Good Rollback on faulty core Rollback/replay on good core Continue Execution Transient or non- deterministic s/w bug Symptom Permanent h/w fault, needs repair! No symptom False positive (iSWAT) or Deterministic s/w bug, send to s/w layer

16 Diagnosis Framework Permanent fault Microarchitecture-Level Diagnosis Unit X is faulty Symptom detected Diagnosis Software bug Transient fault

17 Fault-Free Core Execution Faulty Core Execution Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Diagnosis Algorithm =?

18 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty- core to checkpoint Replay execution, collect info =? Diagnosis Algorithm Fault-Free Core Execution

19 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Rollback faulty- core to checkpoint Replay execution, collect info =? Diagnosis Algorithm Load checkpoint on fault-free core Fault-free instruction exec What info to collect? What info to compare? What to do on divergence? Invoke TBFD

20 Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault sub r6,r1,r x 9 Faulty Fault-free HW used results add r1,r3,r5 0 dec alu 112 dst preg 5 x 3 Both divergent instructions used same ALU  ALU1 faulty

21 r2r2 p 20 4 Complex example: Fault in register alias table (RAT) entry Divergent instructions do not directly lead to faulty unit Instead, look backward/forward in instruction stream –Need to collect and analyze instruction trace Can a Divergent Instruction Lead to Diagnosis? r2r2 p 20 r1r1 logphy p4p4 r3r3 p 13 r5r5 p 24 RAT I A : r 3  r 2 + r 2 phyval p 20 4 p 24 3 Reg File p4p4 8 r3r3 p 55 error! r3r3 p 24 r5r5 3 8 I B : r 1  r 5 * r 2 r1r1 p4p4 p4p4 32 Fault-free r 1 =12 Diverged! But I B does not use faulty HW…

22 Diagnosing Permanent Fault to µarch Granularity Trace-based fault diagnosis (TBFD) –Compare instruction trace of faulty vs. good execution –Divergence  faulty hardware used  diagnosis clues Diagnose faults to µarch units of processor –Check µarch-level invariants in several parts of processor –Front end, Meta-datapath, datapath faults –Diagnosis in out-of-order logic (meta-datapath) complex Results –98% of the faults by SWAT successfully diagnosed –TBFD flexible for other detectors/granularity of repair

23 SWAT 4. Accurate Fault Modeling 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] 1.Detectors w/ Hardware support [ASPLOS ‘08] Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint

24 SWATSim: Fast and Accurate Fault Models Need accurate µarch-level fault models –Gate level injections accurate but too slow –µarch (latch) level injections fast but inaccurate Can we achieve µarch-level speed at gate-level accuracy? Mix-mode (hierarchical) Simulation –µarch-level + Gate-level simulation –Simulate only faulty component at gate-level, on-demand –Invoke gate-level sim at online for permanent faults  Simulating fault effect with real-world vectors

25 SWAT-Sim: Gate-level Accuracy at µarch Speeds µarch simulation r3  r1 op r2 Faulty Unit Used? Continue µarch simulation µarch-Level Simulation No Input Output Gate-Level Fault Simulation Stimuli Response Fault propagated to output Yes r3

26 Results from SWAT-Sim SWAT-sim implemented within full-system simulation –NCVerilog + VPI for gate-level sim of ALU/AGEN modules SWAT-Sim: High accuracy at low overheads –100,000x faster than gate-level, same modeling fidelity –2x slowdown over µarch-level, at higher accuracy Accuracy of µarch models using SWAT coverage/latency –µarch stuck-at models generally inaccurate –Differences in activation rate, multi-bit flips Complex manifestations  Hard to derive better models –Need SWAT-Sim, at least for now

27 SWAT Summary SWAT: SoftWare Anomaly Treatment –Handle all and only faults that matter –Low, amortized overheads –Holistic systems view enables novel solutions –Customizable and flexible Prior results: –Low-cost h/w detectors gave high coverage, low SDC rate This talk: –iSWAT: Higher coverage w/ software-assisted detectors –TBFD: µarch level fault diagnosis by synthesizing DMR –SWAT-Sim: Gate-level fault accuracy at µarch level speed

28 Future Work Recovery: hybrid, application-specific Aggressive use of software reliability techniques –Leverage diagnosis mechanism Multithreaded software Off-core faults Post-silicon debug and test –Use faulty trace as fault-model oblivious test vector Validation on FPGA (w/ Michigan) Hardware assertions to complement software symptoms

BACKUP SLIDES

30 Breakup of Detections by SW symptoms 98% unmasked faults detected within 10M instr (w/o FPU) –Need HW support or SW monitoring for FPU

31 SW Components Corrupted 66% of faults corrupt system state before detection –Need to recover system state

32 Latency from Application mismatch 86% of faults detected under 100k –42% detected under 10k

33 Latency from OS mismatch 99% of faults detected under 100k

34 iSWAT implementation Training Phase Application Compiler Pass in LLVM Application Range s i/p #1.. Range s i/p #n Invariant Ranges Invariant Monitoring Code Compiler Pass in LLVM Application Invariant Checking Code Full System Simulation Inject Faults SWAT Diagnosis Invariant Violation False Positive (Disable Invariant) Fault Detection Fault Detection Phase Test, train, external inputs Ref input

35 Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faulty- core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec TBFD Faults in Front-end Meta-datapath Faults Datapath Faults Faulty traceTest trace =?

36 Fault Diagnosability 98% of detected faults are diagnosed –89% diagnosed to unique unit/array entry –Meta-datapath faults in out-of-order exec mislead TBFD

37 Accuracy of existing Fault Models SWAT-sim implemented within full-system simulator –NCVerilog + VPI to simulate gate-level ALU and AGEN Existing µarch-level fault models inaccurate –Differences in activation rate, multi-bsit flips Accurate models hard to derive  need SWAT-Sim!

38 Summary: SWAT Advantages Handles all faults that matter –Oblivious to low-level failure modes & masked faults Low, amortized overheads –Optimize for common case, exploit s/w reliability solutions Holistic systems view enables novel solutions –Invariant detectors use diagnosis mechanisms –Diagnosis uses recovery mechanisms Customizable and flexible –Firmware based control affords hybrid, app-specific recovery (TBD) Beyond hardware reliability –SWAT treats hardware faults as software bugs  Long-term goal: unified system (hw + sw) reliability at lowest cost –Potential applications to post-silicon test and debug

39 Transients Results 6400 transient faults injected across 8 structures 83% unmasked faults detected within 10M instr Only 0.4% of injected faults results in SDCs