Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

Fault-Tolerant Systems Design Part 1.
Simulation Fault-Injection & Software Fault-Tolerance
Computer System Overview
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
Continuously Recording Program Execution for Deterministic Replay Debugging.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
OS Spring’03 Introduction Operating Systems Spring 2003.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
A. Frank - P. Weisberg Operating Systems Functional View of Operating System.
1 Multi-Level Error Detection Scheme based on Conditional DIVA-Style Verification Kevin Lacker and Huifang Qin CS252 Project Presentation 12/10/2003.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
What are Exception and Interrupts? MIPS terminology Exception: any unexpected change in the internal control flow – Invoking an operating system service.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Introduction to Embedded Systems
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
MICROPROCESSOR INPUT/OUTPUT
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
CS414 Review Session.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Fault-Tolerant Systems Design Part 1.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Fault-Tolerant Systems Design Part 1.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
CMSC 611: Advanced Computer Architecture
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
Exceptional Control Flow
SWAT: Designing Resilient Hardware by Treating Software Anomalies
Exceptional Control Flow
Morgan Kaufmann Publishers
Module: Handling Exceptions
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Computer System Overview
BIC 10503: COMPUTER ARCHITECTURE
Exceptions Control Flow
Co-designed Virtual Machines for Reliable Computer Systems
Computer System Overview
Presentation transcript:

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S. Adve, Y. Zhou (UIUC), ASPLOS’08 Shimin Chen LBA Reading Group Presentation

Introduction  Hardware reliability Aging/wear out Infant mortality (insufficient burn-in) Soft errors (radiation) Design defects  Willing to pay 10% area overhead for reliability Industry panel discussion in SELSE II Conventional dual modular redundancy too costly  How?

Two Observations  Only need to handle observable device faults Faults that propagate through higher levels of the system and observable by software  Fault-free operation is the common case Must be optimized Willing to have increased overhead after a fault is detected

Proposals: Cooperative HW-SW  Detect high-level anomalous SW behavior (symptoms of faults)  Checkpoint/replay + diagnosis components  (For mission-critical system, may incorporate previous backup detection techniques)

Potential Advantages  Generality: oblivious to numerous failure mechanisms and microarchitectures  Ignoring masked faults  Optimizing for the common case  Customizability: which action to take upon fault?  Amortizing overhead across other system functions Reuse online SW bug detection support

Investigation in This Paper Question to answer:  Coverage: What HW faults produce detectable anomalous SW behavior w/ high probability?  Latency: What is the fault detection latency?  Impact on OS: How frequently is OS state corrupted by HW faults? Detection coverage and latency for such faults? Focus on permanent faults (increasingly important) Methodology: Fault-injection study using simulations

Major Results  Detection coverage: most permanent faults that propagate to SW are easily detectable  Detection latency: <= 100K instructions for 86% cases  Impact on OS: often corrupt OS state

Outline  SWAT System Assumptions  Methodology  Results  Implications for Resilient System Design

SWAT (SoftWare Anomaly Treatment) The investigation assumes the following context:  Always-on SW symptom-based detection  A multicore system, at least one fault-free core  Checkpoint/replay mechanism Replay when fault is detected If anomalous behavior is deterministic, this is HW fault, recover using a fault-free core Otherwise ignore (transient)  HW has the ability to repair or reconfigure around permanent faults  Firmware controlled diagnosis and recovery hide HW errors from becoming externally visible

Outline  SWAT System Assumptions  Methodology  Results  Implications for Resilient System Design

Simulation Environment  Virtutech Simics + Wisconsin GEMS micro- architectural and memory timing simulators  SPARC V9 ISA, 6 SpecInt2000, 4 SpecFP2000  OS activity < 1% for fault-free runs

Fault Injection  Timing-first approach in GEMS Cycle-accurate GEMS timing simulator Simics functional simulator Compare and set GEMS state based on simics state (so GEMS can skip the support for some rare instructions)  Fault injection Inject fault into GEMS timing simulator If the mismatched states are due to fault injection, corrupt simics states  Activated fault vs. architecturally masked fault: If GEMS state mismatched simics state?  OS or user mode? Check privilege mode

Fault Model: permanent faults  Stuck-at-0: A bit is always 0  Stuck-at-1: A bit is always 1  Dominant-0 Acts like a logical-AND between adjacent faulty bits  Dominant-1 Acts like a logical-OR between adjacent faulty bits  Dominant-x a.k.a. bridging fault

Number of Injected Faults  10 benchmarks  40 random points per benchmark after initialization  4 fault models  8 micro-architectural structures  Total = 10 x 40 x 4 x 8 = 12800

Fault Detection  Run 10 million instructions with detailed simulation  If no SW symptom is detected, run fast simulation to finish the benchmark and check for corruption

Fatal HW Trap Typically not thrown during a correct execution SPARC:  Data Access Exception  Division by zero  Illegal instruction  Memory misaligned  Recover Error and Debug (too many nested traps)  Watchdog reset (no instruction retires in the last cycles)

Abnormal Application Exit  Application may have a seg fault or assertion failure  OS knows the exit status  In simulation, looks for OS idle loop as an indication of such an exit

Hangs  During the 10 million instructions  Keep a counter per observed branch PC  Increment the counter for a branch  If any counter exceeds 100,000 (or 1% of the total instructions), then flag a hang  Profiling the fault-free executions and mask out a handful of branches that do not satisfy this

High OS Activity  Amount of time the execution remains in OS  Typically control returns to user-mode for a few 10s of instructions except A timer interrupt after a quantum expires (this < 10,000 instructions) System calls (could be 100K to 1 million instructions)  Detection threshold: Over 30,000 contiguous OS instructions But not in a system call

Metrics  Coverage: Masked faults: architecture + application  Detection latency: total number of instructions retired from the first architecture state corruption till the detection of the fault within 10M instructions

Outline  SWAT System Assumptions  Methodology  Results  Implications for Resilient System Design

How do faults manifest in SW? FPU Excluded

What are masked faults?  Stuck-at faults: Register file: unused physical register RAT: unused logical register FPU: integer benchmarks  Bridging faults: Upper 32-bit in a 64-bit operations Often sign extensions: all-1 or all-0 In SW, small data size

Large number of detections in OS  Although low OS activity, over 65% of detected faults are through symptoms from the OS  Why? A fault in user-mode often results in a memory access to a cold address, invoking a TLB miss SPARC TLB miss is software managed In OS trap, the same faulty HW OS is more control and memory intensive Often result in corrupted OS state

Fatal HW Traps  Illegal instruction traps: opcode bit changes result in illegal opcode (decoder)  Watchdog timer reset: over cycles no retired instructions ROB or RAT errors: register src, dest dependences are messed up, resulting in some kind of indefinite wait  Misaligned accesses: Memory addresses are wrong  Red state exception: Over 4 nested traps

High-OS  OS trap handling TLB miss  Permanent HW fault corrupts TLB handler, resulting in the code never returning to user-mode  Significant overlap with fatal traps and hangs High-OS detects 30% of the faults Remove it reduces coverage by 15% Many cases eventually lead to fatal traps or hangs But detecting High-OS reduces latency

Others  Application aborts: 1% coverage  Hangs: 3% coverage Mostly in application Because OS-hangs are often detected first as High-OS E.g. loop index variable is wrong, never terminate

Undetected Faults  All but FPU, 0.8% of injected faults result in silent data corruption  FPU: 10% of faults result in silent data corruption Why? FPU results hardly affect memory addresses or program control

Which SW components are corrupted? Need to checkpoint OS None case: watchdog reset trap, the first instruction in ROB is blocked

Detection latency  Application state corruption  OS state corruption

Latency from App State Corruption Some Combination of SW and HW checkpointing schemes are needed

Latency from OS State Corruption HW checkpointing schemes may be sufficient

Transient Faults Have Different Characteristics  94% are architecturally masked within the 10M instruction window  3.4% are detected in the 10M window  1.2% are masked by applications  1.3% eventually results in detectable symptoms  Only 0.1% of the total injections result in silent data corruption

Outline  SWAT System Assumptions  Methodology  Results  Implications for Resilient System Design

Detection  A majority of permanent faults that propagate to SW are detectable through low-cost monitoring of simple symptoms  Preliminary experiments show that the use of value-based invariants can significantly improve latency and coverage  FPU: use more HW mechanisms

Recovery  OS recovery is necessary HW recovery mechanisms (e.g. ReVive, SafetyNet) may be sufficient  Application recovery requires SW checkpoints