MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Continuously Recording Program Execution for Deterministic Replay Debugging.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Software Faults and Fault Injection Models --Raviteja Varanasi.
Defining Anomalous Behavior for Phase Change Memory
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
CprE 458/558: Real-Time Systems
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Fail-stutter Behavior Characterization of NFS
Presented by: Daniel Taylor
Ph.D. in Computer Science
Modeling Stream Processing Applications for Dependability Evaluation
nZDC: A compiler technique for near-Zero silent Data Corruption
SWAT: Designing Resilient Hardware by Treating Software Anomalies
Effective Data-Race Detection for the Kernel
Operating System Reliability
Operating System Reliability
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Operating System Reliability
Operating System Reliability
Soft Error Detection for Iterative Applications Using Offline Training
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Yikes! Why is my SystemVerilog Testbench So Slooooow?
InCheck: An In-application Recovery Scheme for Soft Errors
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Hardware Counter Driven On-the-Fly Request Signatures
Prof. Leonardo Mostarda University of Camerino
Operating System Reliability
Dynamic Verification of Sequential Consistency
Abstractions for Fault Tolerance
MapReduce: Simplified Data Processing on Large Clusters
MicroScope: Enabling Microarchitectural Replay Attacks
Operating System Reliability
DMP: Deterministic Shared Memory Multiprocessing
Operating System Reliability
Presentation transcript:

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

Motivation Goal Hardware resilience with low overhead Previous Work SWAT – low-cost fault detection and diagnosis For single-threaded workloads This work Fault detection and diagnosis for multithreaded apps The goal of this project is to provide high hardware reliability with low overehad in terms of area, perf and power. Since the reliability probelm is becoming pervasive across different markets, we need a low cost reliability solution. Previously we propsed SWAT that detects faults with low overhead symptom detectors. And it diagnoses faults at microarchitecture level With the advent of multicore era, multithreaded applications are becoming common. Therefore, in this work we focus on fault detection and diagnosis for multithreaded applications

SWAT Background SWAT Observations SWAT Approach Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized SWAT Approach Watch for software anomalies (symptoms) Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked

SWAT Framework Components Detectors with simple hardware Detectors with compiler support Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint µarch-level Fault Diagnosis (TBFD)

Shown to work well for single-threaded apps Challenge Detectors with simple hardware Detectors with compiler support Fault Error Symptom detected Recovery Diagnosis Repair Checkpoint Shown to work well for single-threaded apps Does SWAT approach work on multithreaded apps? µarch-level Fault Diagnosis (TBFD)

Multithreaded Applications Multithreaded apps share data among threads Symptom causing core may not be faulty Need to diagnose faulty core Core 2 Fault Core 1 Store Memory Load Symptom Detection on a fault-free core

Contributions Evaluate SWAT detectors on multithreaded apps High fault coverage for multithreaded workloads too Observed symptom from fault-free cores Novel fault diagnosis for multithreaded apps Identifies the faulty core despite fault propagation Provides high diagnosability

Outline Motivation MSWAT Detection MSWAT Diagnosis Results Summary and Advantages Future Work

SWAT Hardware Fault Detection Low-cost monitors to detect anomalous software behavior Fatal traps detected by hardware Division by Zero, RED State, etc. Hangs detected using simple hardware hang detector High OS activity using performance counters Typical OS invocations take 10s or 100s of instructions

MSWAT Fault Detection New symptom: Panic detected when kernel panics Detected using hardware debug registers SWAT-like detectors provide high coverage

Fault Diagnosis After detection, invoke diagnosis to identify the faulty core Replay fault activating execution

SWAT Fault Diagnosis Rollback/replay on same/different core Single-threaded application on multicore Faulty Good Symptom detected Rollback on faulty core No symptom Symptom Deterministic s/w or Permanent h/w bug Transient or Non-deterministic s/w bug Rollback/replay on good core Continue Execution No symptom Symptom Permanent h/w fault, needs repair! Deterministic s/w bug, send to s/w layer

No known good core available SWAT Fault Diagnosis Rollback/replay on same/different core Single-threaded application on multicore No known good core available Faulty core is unknown Faulty Good Symptom detected Rollback on faulty core No symptom Symptom Deterministic s/w or Permanent h/w bug Transient or Non-deterministic s/w bug Rollback/replay on good core Continue Execution No symptom Symptom Permanent h/w fault, needs repair! Deterministic s/w bug, send to s/w layer

Extending SWAT Diagnosis to Multithreaded Apps Naïve extension – N known good cores to replay the trace Too expensive – area Requires full-system deterministic replay Simple optimization – One spare core Not Scalable, requires N full-system deterministic replays Requires a spare core Single point of failure C1 S C2 C3 Symptom Detected C1 S C2 C3 Symptom Detected Faulty core is C2 C1 S C2 C3 No Symptom Detected

MSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TA TB TC TD TA TA TA

MSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TA TB TC TD TA TD TA TB TC TC TD TA TB

Multicore Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace TA TB TC TD A B C D Example

Multicore Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace A B C D A B C D TA TB TC TD Example

Multicore Fault Diagnosis Algorithm Symptom detected Capture fault activating trace Re-execute Captured trace Faulty core Look for divergence TA TB TC TD A B C D TD TA TB TC A B C D Example TA A B C D No Divergence Divergence Faulty core is B

Multicore Fault Diagnosis Algorithm What info to capture for deterministic isolated replay? Symptom detected Capture fault activating trace Deterministic isolated replay Faulty core Look for divergence

Enabling Deterministic Isolated Replay Thread Input to thread Ld Capturing input to thread is sufficient for deterministic replay Record all retiring loads Enables isolated replay of each thread

Multicore Fault Diagnosis Algorithm How to identify divergence? Symptom detected Capture fault activating trace Deterministic isolated replay Faulty core Look for divergence

Identifying Divergence Comparing all instructions  Large buffer requirement Faults corrupt software through Memory and control instructions Comparing all retiring store and branch is sufficient Thread Store Branch

Hardware Cost The first replay is native execution Minor support for collection of trace Deterministic replay is firmware emulated Requires minimal hardware support Replay threads in isolation  No need to capture memory orderings

Repeatedly execute on short traces Trace Buffer Size Long detection latency  large trace buffers (8MB/core) Need to reduce the size requirement  Iterative Diagnosis Algorithm Repeatedly execute on short traces e.g. 100,000 instrns Symptom detected Capture fault activating trace Deterministic isolated replay Faulty core

Experimental Methodology Microarchitecture-level fault injection GEMS timing models + Simics full-system simulation Six multithreaded applications on OpenSolaris Permanent fault models Stuck-at faults in latches of 7 arch structures Simulate impact of fault in detail for 20M instructions 20M instr Timing simulation If no symptom in 20M instr, run to completion Functional simulation Fault App masked, or symptom > 20M, or silent data corruption (SDC)

Experimental Methodology Iterative algorithm with 100,000 instrns in each iteration Until divergence or 20M instrns Deterministic replay is native execution not firmware emulated

Results: MSWAT Fault Detection Coverage: Over 98% faults detected Only 0.2% give Silent Data Corruptions (SDCs) Low SDC rate of 0.4% for transient faults as well 12% of detections occur in fault-free core Data sharing propagates faults from faulty to fault-free core

Results: MSWAT Fault Diagnosis (1/2) Over 95% of detected faults are successfully diagnosed All faults detected in fault-free core are diagnosed

Results: MSWAT Fault Diagnosis (2/2) Diagnosis Latency 97% diagnosed <10 million cycles (10ms in 1GHz system) 93% of these were diagnosed in 1 iteration Showing the effectiveness of iterative approach Trace Buffer size 96% require <200KB/core of loadLog & compareLog Trace buffer can easily fit in L2 or L3 cache

MSWAT Summary and Advantages Detection Coverage over 98% with low SDC rate of 0.2% Diagnosis High diagnosability over 95% with low diagnosis latency Firmware based replay reduces hw overhead Scalable - maximum of 3 replays for any system Iterative approach significantly reduces Trace buffer size (8MB/core → 400KB/core) Diagnosis latency

Future Work Extending this study to server applications Off-core faults Post-silicon debug and test Use faulty trace as test vector Validation on FPGA (w/ Michigan)

Thank you

Backup

Hardware support Detection Simple hardware detectors Hw support to ensure correct invocation of firmware Diagnosis Small hardware buffer for memory backed trace buffer Minor design changes to capture retiring instrns Hw checks to prevent trace corruption of good cores

Trace Buffer Collect loadLog and compareLog as a merged trace buffer A small FIFO that is memory backed Minimizes hardware cost Diagnosis can tolerate small performance slack Similar to one used in BugNet and SWAT’s TBFD Potential problem: Faulty core can corrupt trace buffer of other cores One solution: H/W bounds check – a core writes only to its trace region

Transient vs. SW Bug vs. Permanent Fault Symptom detected Screening phase Symptom? No Yes Transient h/w fault or Non-deterministic s/w bug Deterministic s/w bug or Permanent h/w fault Continue Execution

Multicore Fault Diagnosis Algorithm Example: A three core system Deterministic s/w bug or Permanent h/w fault A B C Trace generation phase TraceA TraceB TraceC Divergence Divergence A B C TraceC TraceA TraceB First replay phase Number of divergences? Zero One Deterministic s/w bug Second replay phase Faulty core identified A TraceB

Multicore Fault Diagnosis Algorithm Example: A three core system Deterministic s/w bug or Permanent h/w fault A B C TraceA TraceB TraceC Trace generation phase Divergence Divergence A B C TraceC TraceA TraceB First replay phase Number of divergences? Zero Two One Deterministic s/w bug Second replay phase Faulty core identified SWAT TBFD to diagnose -arch level faulty unit

Multicore Fault Diagnosis Algorithm First replay phase Trace generation phase Symptom detected Number of divergences? Zero Deterministic s/w bug Second replay phase One Two Faulty core identified SWAT TBFD to diagnose -arch level faulty unit

Reliability of firmware SWAT philosophy Low hw overhead  firmware based implementation How to guarantee correct execution of firmware on faulty hw? Detection Hw support ensures correct invocation of firmware Diagnosis Use hw check to not corrupt trace buffers of other cores Diagnosis outcome checked by two cores Prevents faulty core from subverting the process