MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Simulation Fault-Injection & Software Fault-Tolerance
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.
Hardware Support for Spin Management in Overcommitted Virtual Machines Philip Wells Koushik Chakraborty Gurindar Sohi {pwells, kchak,
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Continuously Recording Program Execution for Deterministic Replay Debugging.
BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
DTHREADS: Efficient Deterministic Multithreading
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
Defining Anomalous Behavior for Phase Change Memory
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
We are living in a New Virtualized World Sorav Bansal IIT Delhi Feb 26, 2011.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
StealthTest: Low Overhead Online Software Testing Using Transactional Memory Jayaram Bobba, Weiwei Xiong*, Luke Yen †, Mark D. Hill, and David A. Wood.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
nZDC: A compiler technique for near-Zero silent Data Corruption
SWAT: Designing Resilient Hardware by Treating Software Anomalies
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Soft Error Detection for Iterative Applications Using Offline Training
InCheck: An In-application Recovery Scheme for Soft Errors
Hardware Counter Driven On-the-Fly Request Signatures
Co-designed Virtual Machines for Reliable Computer Systems
Dynamic Verification of Sequential Consistency
Abstractions for Fault Tolerance
Presentation transcript:

mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign

Motivation Hardware will fail in-the-field due to several reasons  Need in-field detection, diagnosis, repair, and recovery Reliability problem pervasive across many markets –Traditional redundancy solutions (e.g., nMR) too expensive  Need low-cost solutions for multiple failure sources  Must incur low area, performance, power overhead Transient errors (High-energy particles ) Wear-out (Devices are weaker) Design Bugs … and so on

SWAT: Low-Cost Hardware Reliability SWAT Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized SWAT Approach  Watch for software anomalies (symptoms) Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked 3

SWAT Framework Components FaultErrorSymptom detected Recovery DiagnosisRepair Checkpoint Detectors with simple hardware [Li et al. ASPLOS’08] µarch-level Fault Diagnosis (TBFD) [Li et al. DSN’09] 4

Challenge FaultErrorSymptom detected Recovery DiagnosisRepair Checkpoint Detectors with simple hardware [Li et.al. ASPLOS’08] µarch-level Fault Diagnosis (TBFD) [Li et.al. DSN’09] 5 Shown to work well for single-threaded apps Does SWAT approach work on multithreaded apps?

Challenge: Data sharing in multithreaded apps Multithreaded apps share data among threads Does symptom detection work? Symptom causing core may not be faulty –How to diagnose faulty core? 6 Memory Store Symptom Detection on a fault-free core Load Core 1 Core 2 Error Fault

Contributions Evaluate SWAT detectors on multithreaded apps –Low Silent Data Corruption rate for multithreaded apps –Observed symptom from fault-free cores Novel fault diagnosis for multithreaded apps –Identifies the faulty core despite error propagation –Provides high diagnosability 7

Outline Motivation mSWAT Detection mSWAT Diagnosis Results Summary and Future Work 8

mSWAT Fault Detection SWAT Detectors: –Low-cost monitors to detect anomalous sw behavior –Incur near-zero perf overhead in fault-free operation Symptom detectors provide low Silent Data Corruption rate Fatal Traps Division by zero, RED state, etc. Kernel Panic OS enters panic State due to fault High OS High contiguous OS activity Hangs Simple HW hang detector App Abort Application abort due to fault

SWAT Fault Diagnosis Rollback/replay on same/different core –Single-threaded application on multicore No symptom Symptom Deterministic s/w or Permanent h/w bug Symptom detected Faulty Rollback on faulty core Rollback/replay on good core Continue Execution Transient or Non-deterministic s/w bug Symptom Permanent h/w fault, needs repair! No symptom Deterministic s/w bug, send to s/w layer 10 Good

Challenges Rollback/replay on same/different core –Single-threaded application on multicore No symptom Symptom Deterministic s/w or Permanent h/w bug Symptom detected Rollback on faulty core Rollback/replay on good core Continue Execution Transient or Non-deterministic s/w bug Symptom Permanent h/w fault, needs repair! No symptom Deterministic s/w bug, send to s/w layer 11 FaultyGood No known good cores available Faulty core is unknown How to replay multithreaded apps?

Extending SWAT Diagnosis to Multithreaded Apps Assumptions: In-core faults, single core fault model Naïve extension – N known good cores to replay the trace Too expensive – area Requires full-system deterministic replay Simple optimization – One spare core Not scalable, requires N full-system deterministic replays High hardware overhead – requires a spare core Single point of failure – spare core 12 Faulty core is C2 C1C2C3 No Symptom Detected Spare C1C2C3 Symptom Detected Spare C1C2C3 Symptom Detected Spare

mSWAT Diagnosis - Key Ideas 13 Challenges Multithreaded applications Full-system deterministic replay No known good core Emulated TMR TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD TATA A B C D TATA Key Ideas Isolated deterministic replay

mSWAT Diagnosis - Key Ideas 14 Challenges Multithreaded applications Full-system deterministic replay No known good core Emulated TMR TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD A B C D Key Ideas Isolated deterministic replay TATA TBTB TATA TBTB

mSWAT Diagnosis - Key Ideas 15 Challenges Multithreaded applications Full-system deterministic replay No known good core Emulated TMR TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD A B C D Key Ideas Isolated deterministic replay TATA TBTB TCTC TCTC TATA TBTB

mSWAT Diagnosis - Key Ideas 16 Challenges Multithreaded applications Full-system deterministic replay No known good core Emulated TMR TATA TBTB TCTC TDTD TATA TATA TBTB TCTC TDTD A B C D Key Ideas Isolated deterministic replay TDTD TATA TBTB TCTC TCTC TDTD TATA TBTB Maximum 3 replays

Multicore Fault Diagnosis Algorithm Overview 17 Replay & capture fault activating trace TATA TBTB TCTC TDTD A B C D Example Symptom detected Diagnosis Deterministically replay captured trace

Multicore Fault Diagnosis Algorithm Overview 18 Replay & capture fault activating trace Symptom detected Diagnosis Deterministically replay captured trace TATA TBTB TCTC TDTD A B C D Example

Multicore Fault Diagnosis Algorithm Overview 19 Replay & capture fault activating trace Symptom detected Diagnosis Deterministically replay captured trace TATA TBTB TCTC TDTD A B C D TDTD TATA TBTB TCTC Example Look for divergence

Multicore Fault Diagnosis Algorithm Overview 20 Replay & capture fault activating trace Symptom detected Diagnosis Deterministically replay captured trace Look for divergence TATA TBTB TCTC TDTD A B C D TDTD TATA TBTB TCTC Divergence Example TATA A B C D Faulty core is A Fault-free Cores Divergence Faulty core

Digging Deeper 21 Symptom detected Replay & capture fault activating trace Isolated deterministic replay Faulty core Look for divergence What info to capture to enable isolated deterministic replay? How to identify divergence? Hardware costs?

Enabling Isolated Deterministic Replay 22 Thread Input to thread Ld Recording thread inputs sufficient – similar to BugNet Record all retiring loads values

Digging Deeper (Contd.) 23 Symptom detected Replay & capture fault activating trace Isolated deterministic replay Faulty core Look for divergence How to identify divergence? Trace Buffer What info to capture to enable isolated deterministic replay? Hardware costs?

Identifying Divergence 24 Thread Comparing all instructions  Large buffer requirement Faults corrupt software through memory and control instrns Capture memory and control instructions Store Load Branch Store

Digging Deeper (Contd.) 25 Symptom detected Replay & capture fault activating trace Isolated deterministic replay Faulty core Look for divergence How to identify divergence? Trace Buffer What info to capture to enable isolated deterministic replay? Hardware costs?

How Big? Hardware Costs 26 Native Execution Memory Backed Log Small hardware support Firmware Emulation Minor support for firmware reliability Symptom detected Replay & capture fault activating trace Isolated deterministic replay Faulty core Look for divergence Trace Buffer What if the faulty core subverts the process? Key Idea: On a divergence two cores take over

Trace Buffer Size Long detection latency  large trace buffers (8MB/core) –Need to reduce the size requirement  Iterative Diagnosis Algorithm 27 Repeatedly execute on short traces e.g. 100,000 instrns Symptom detected Replay & capture fault activating trace Isolated deterministic replay Faulty core Look for divergence Trace Buffer

Experimental Methodology Microarchitecture-level fault injection –GEMS timing models + Simics full-system simulation –Six multithreaded applications on OpenSolaris  4 Multimedia apps and 1 each from SPLASH and PARSEC –4 core system running 4-threades apps Faults in latches of 7  arch units –Permanent (stuck-at) and transients faults 28

Experimental Methodology Detection: Metrics: SDC Rate, detection latency Diagnosis: Iterative algorithm with 100,000 instrns in each iteration Until divergence or 20M instrns Deterministic replay is native execution Not firmware emulated Metrics: Diagnosability, overheads 29 10M instr Timing simulation If no symptom in 10M instr, run to completion Functional simulation Fault Masked or Silent Data Corruption (SDC)

Results: mSWAT Detection Summary SDC Rate: Only 0.2% for permanents & 0.55% for transients Detection Latency: Over 99% detected within 10M instrns 30 Detected

Results: mSWAT Detection Summary SDC Rate: Only 0.2% for permanents & 0.55% for transients Detection Latency: Over 99% detected within 10M instrns % detected in a good core

Results: mSWAT Diagnosability Over 95% of detected faults are successfully diagnosed All faults detected in fault-free core are diagnosed Undiagnosed faults: 88% did not activate faults

Results: mSWAT Diagnosis Overheads Diagnosis Latency –98% diagnosed <10 million cycles (10ms in 1GHz system) –93% were diagnosed in 1 iteration  Iterative approach is effective Trace Buffer size –96% require <400KB/core  Trace buffer can easily fit in L2 or L3 cache 33

mSWAT Summary Detection: Low SDC rate, detection latency Diagnosis – identifying the faulty core –Challenges: no known good core, deterministic replay –High diagnosability with low diagnosis latency –Low Hardware overhead - Firmware based implementation –Scalable – maximum 3 replays for any system Future Work: –Reducing SDCs, detection latency, recovery overheads –Extending to server apps; off-core faults –Validation on FPGAs (w/ Michigan) 34