Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.

Slides:

Advertisements

Similar presentations

Artemis: Practical Runtime Monitoring of Applications for Execution Anomalies Long Fei and Samuel P. Midkiff School of Electrical and Computer Engineering.

Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.

November 5, 2007 ACM WEASEL Tech Efficient Time-Aware Prioritization with Knapsack Solvers Sara Alspaugh Kristen R. Walcott Mary Lou Soffa University of.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.

Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.

Continuously Recording Program Execution for Deterministic Replay Debugging.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

1 Low Overhead Program Monitoring and Profiling Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania {naveen,

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

Yuanyuan ZhouUIUC-CS Architectural Support for Software Bug Detection Yuanyuan (YY) Zhou and Josep Torrellas University of Illinois at Urbana-Champaign.

LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.

1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Software Faults and Fault Injection Models --Raviteja Varanasi.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

R Enabling Trusted Software Integrity Darko Kirovski Microsoft Research Milenko Drinić Miodrag Potkonjak Computer Science Department University of California,

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:

SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Verification of FT System Using Simulation Petr Grillinger.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,

Efficient Software-based Fault Isolation Robert Wahbe, Steven Lucco, Thomas E. Anderson & Susan L. Graham Presented By Tony Bock.

Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.

Efficient software-based fault isolation Robert Wahbe, Steven Lucco, Thomas Anderson & Susan Graham Presented by: Stelian Coros.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.

Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.

GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,

Optimistic Hybrid Analysis

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

nZDC: A compiler technique for near-Zero silent Data Corruption

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Hwisoo So. , Moslem Didehban#, Yohan Ko

InCheck: An In-application Recovery Scheme for Soft Errors

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Presentation transcript:

Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois, Urbana-Champaign

Motivation In-the-field hardware failures expected to be more pervasive –Traditional solutions (e.g., nMR) too expensive  Need low-cost in-field detection, diagnosis, recovery, repair Two Key Observations –Handle only hardware faults that propagate to software –Fault-free case remains common, must incur low-overhead Watch for software anomaly (symptoms) –Observe simple symptoms for perm and transient faults [ASPLOS ‘08]  SWAT: SoftWare Anomaly Treatment

Motivation – Improving SWAT SWAT error detection coverage is excellent [ASPLOS ‘08] –Effective for faults affecting control-flow and most pointer values SWAT symptoms ineffective, if only data values are corrupted  Non-negligible Silent Data Corruption (1.0% SDCs) This work reduces SDCs for symptom-based detection –Uses software level likely invariants

Likely Program Invariants … x = … y = fun (x) … x = … y = fun (x) check( 0 <= y <= 100) … Training runs may determine “y” lies between 0 and 100 –Insert checks to monitor this likely invariant A bit flip in ALU Value of “y” > 100 –Inserted checks will identify such faults ALU Fault Register Fault … Likely invariants: Properties which hold on all training inputs, expected to hold on others

False Positive Invariants … y = sin (x) … y = sin (x) check( 0 <= y <= 1) … Training runs may determine “y” lies between 0 and 1 For a particular input outside the training set –Value of “y” may be < 0 –This violation is a false positive False positive: Likely invariants which doesn’t hold for a particular input

Challenges Previous work –Likely invariants have been used for software debugging –Some work on hardware faults, but only for transient faults Challenge-1 –Are invariants effective for permanent faults? Which types of invariants? Challenge-2 –How to handle false positive invariants efficiently for perm faults? Simple techniques like pipeline flush will not work – s/w level invs Will need some form of checkpoint, rollback/replay mechanism –Expensive, cost of replay will depend on detection latency Rollback/replay on original core will not work with permanent faults

Summary of Contributions First work to use likely invariants to detect permanent faults First method to handle false positives efficiently for software level invariant-based detections –Leverages the SWAT hardware diagnosis framework [Li et al., DSN ’08] Full-system simulation for realistic programs SDCs reduces by nearly 74%

Outline Motivation and Likely Program Invariants Invariant-based detection Framework Implementation Details Experimental Results Conclusion and Future Work

Invariant-based detection Framework Which types of Invariants to use? –Value-based: ranges, multiple ranges …? –Address-based? –Control-flow? How to handle false positive invariants?

Which types of invariants to use? Our focus on data value corruptions –Need value-based invariants as a detection method –Many possible invariants, we started with the simplest likely inv Uses range-based likely invariants –Checks of type MIN  value  MAX on data values Advantages? –Easily enforced with little overhead –Easily and efficiently generated –Composable, so training can be done in parallel Disadvantages? –Restrictive, does not capture general program properties

How to identify false positives? Assume rollback/restart mechanism, fault free core Handling false positives for permanent faults Execution in absence of any fault Inv Violation detected Checkpoint Inv Violation detected  False positive Replay on a fault free core from latest Checkpoint

How to limit false positives? Train with many different inputs to reduce false positives To limit the overhead due to rollback/replay –We observe that some of the invariants are sound invariants –Among the remaining invariants Very few static false positives for individual inputs –Disable static invariants found to be false positive Maximum number of rollback <= number of static false positives Limits overhead (Max rollbacks found to be 7 for ref input in our apps) –We still have most of the invariants enabled for effective detection

False Positive Detection Methodology Modified SWAT diagnosis module [Li et al., DSN ‘08] Inv violation doesn’t recurInv violation recurs Transient h/w bug, or non-deterministic s/w bug Continue execution … Deterministic s/w bug, False positive Inv, or Permanent h/w bug Rollback, restart on different core Permanent defect in original core Invariant Violation detected Rollback to previous checkpoint, restart on original core No violation Deterministic s/w bug, False positive Inv Violation Disable Invariants Continue execution Start Diagnosis

if ( isFalsePos ( Inv_Id ) ) / / Perform diagnosis FalsePosArray [Inv_Id] = true ; / / Disable the invariant // else hardware fault detected if ( isFalsePos ( Inv_Id ) ) / / Perform diagnosis FalsePosArray [Inv_Id] = true ; / / Disable the invariant // else hardware fault detected Template of Invariant Checking Code Insert checks after the monitored value is produced An array indexed by the invariant-id is used Keeps track of found false positive invariants if ( ( value max ) ) { / / This Invariant is violated if ( FalsePosArray [Inv_Id] != true ) { / / Invariant not yet disabled } } if ( FalsePosArray [Inv_Id] != true ) { / / Invariant not yet disabled }

iSWAT: Invariant-based detection Framework iSWAT = SWAT + Invariant-detection SWAT symptoms [Li et al., ASPLOS ‘08] –Fatal-Trap –Application aborts –Hangs –High-OS

Outline Motivation and Likely Program Invariants Invariant-based detection Framework Implementation Details Experimental Results Conclusion and Future Work

iSWAT: Implementation Details iSWAT has two distinct phases 1.Training phase o Generation of invariant ranges using training inputs 2.Code Generation phase o Generation of binary with invariant checking code inserted

iSWAT: Training Phase App Compiler Pass written in LLVM Invariant Ranges Invariant Monitoring Code Training Runs App Invariant Generation Invariant generation pass –Extracts invariants from training runs –Training set determined by accepted false positive rate –Invariants for stores of Integers of 2/4/8 bytes, floats and doubles Ranges i/p #1 Ranges i/p #n......

iSWAT: Code Generation Phase Invariant Checking Code Generation App Compiler Pass written in LLVM Invariant Ranges App Invariant Checking Code Invariant insertion pass –Inserts invariant checking code into binary –Generated code monitors value ranges at runtime

Outline Motivation and Likely Program Invariants Invariant-based detection Framework Implementation Details Experimental Results Conclusion and Future Work

Methodology-1 Simics+GEMS * full system simulator: Solaris-9, SPARC V9 Stuck-at and bridging fault models Structures –Decoder, Integer ALU, Register bus, Integer register, ROB, RAT, AGEN unit, FP ALU Five applications - 4 SpecInt and 1 SpecFP –gzip, bzip2, mcf, parser, art –Training inputs comprised of train, test, and external inputs –Ref input used for evaluation 6400 total fault injections –5 apps * 40 points per app * 4 fault models * 8 structures * Thanks to WISC GEMS group

Methodology-2 Metrics –False Positives –SDCs –Latency –Overhead Faults injected for 10M instructions using timing simulation –SDCs identified by running functional simulation to completion –Faults not injected after 10M instr  act as intermittents –Invariants not monitored after 10M  SDC conservative –We consider faults identified after 10M instr as unrecoverable

False positives False positive rate < 5% Very few rollbacks to detect false pos (Max 7 for ref input) In the worst case, 231 rollbacks (for gzip) False pos rate : % of static invariants that are false positives

Previous SWAT symptoms InvariantsUnrecoverableSDC SWAT96%N/A4.0% (168)0.74% (31) iSWAT89%7.7%2.9% (120)0.19% (08) SDCs % of non-masked faults detected by each detection method iSWAT detects many undetected faults in SWAT In 10M instr Reduction in unrecoverable faults: 28.6% Reduction in SDCs: 74%

SDC Analysis - 1 Most effective in ALU, register, register bus units

SDC Analysis - 2 For remaining SDCs corrupted values still within range –Faults result in slight value perturbations –Can potentially be reduced with better invariants Most of the SDCs are due to bridging faults In SDC cases, value mismatches in lower-order bi ts In most cases in lowest 3 bits Latency improvements are not significant –There is 2%-3% improvement for various latency categories –More sophisticated invariants are needed

Overhead Mean overhead on UltraSPARC-IIIi: 14% Mean overhead on AMD Athlon: 5% Not optimized –overhead should be less due to parallelism

Summary of Results False positive rate < 5% with only 12 training inputs Reduction in SDCs: 74% Low overhead: 5% to 14%

Conclusion and Future Work Simple range-based value invariants –Reduces SDCs significantly –False positives are handled with low overhead – Low checking overhead Investigation of more sophisticated invariants –More sophisticated value invariants –Address-based and Control-flow based invariants Monitoring of other program values Strategy to select the most effective invariants Exploring hardware support to reduce overhead

Questions Questions?

Back up slides

Coverage iSWAT detects many undetected faults in SWAT In 10M instr Coverage improves from 96%  97.2% Reduction in unknowns: 28.6% Most effective in ALU, register, register bus units Coverage improvement of iSWAT over SWAT after 10M instructions Fatal-Trap-AppFatal-Trap-OSHang-AppINVHigh-OSUnknown SWAT22.4%25.4%0.8%-23.3%3.0% iSWAT21.2%24.3%0.5%5.8%21.1%2.1% Fatal-Trap-AppFatal-Trap-OSHang-AppINVHigh-OSUnknown SWAT22.4%25.4%0.8%-23.3%3.0% iSWAT21.2%24.3%0.5%5.8%21.1%2.1% Fatal-Trap-AppFatal-Trap-OSHang-AppINVHigh-OSUnknown SWAT22.4%25.4%0.8%-23.3%3.0% iSWAT21.2%24.3%0.5%5.8%21.1%2.1% Fatal-Trap-AppFatal-Trap-OSHang-AppINVHigh-OSUnknown SWAT22.4%25.4%0.8%-23.3%3.0% iSWAT21.2%24.3%0.5%5.8%21.1%2.1% Fatal-Trap-AppFatal-Trap-OSHang-AppINVHigh-OSUnknown SWAT22.4%25.4%0.8%-23.3%3.0% iSWAT21.2%24.3%0.5%5.8%21.1%2.1%

Latency of Detection Latency improvements are not significant There is 2%-3% improvement for various latency categories More sophisticated invariants are needed Latency<1k<10k<100k<1M<10M SWAT41.1%50.7%81.0%90.3%98.7% iSWAT43.1%53.4%83.3%92.7%100.0% Latency improvement of iSWAT over SWAT

Comparisons Racunas –Uses hardware monitoring –Only for transient faults –Little checking overhead, but needs lot of hardware –Lower coverage (50%-70%), as short detection latency is needed Pattabiraman –Only for transient faults –No concrete solution for false positives –45% h/w area overhead –5% clock period slowdown –Overhead of extra check instructions?

Cmparisons Argus –Only works for simple cores –Technique doesn’t work with I/O, Interrupts, exceptions etc. –Area overhead of nearly 17% –Performance overhead 4% –Some errors will go undetected Multi-bit errors with structures protected by parity Errors in unprotected areas Multiple-error scenarios Some memory access errors Errors hidden by aliasing –Argus h/w is unprotected => can cause false positives –Evaluation only with micro-benchmark –Piece by piece solution rather than a uniform/integrated solution