University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

Slides:



Advertisements
Similar presentations
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Instruction-Level Parallelism (ILP)
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Shoestring: Probabilistic.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
DRACO Architecture Research Group. DSN, Edinburgh UK, Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Efficient Software-Based Fault Isolation—sandboxing Presented by Carl Yao.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
nZDC: A compiler technique for near-Zero silent Data Corruption
Daya S Khudia, Griffin Wright and Scott Mahlke
RegLess: Just-in-Time Operand Staging for GPUs
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Ann Gordon-Ross and Frank Vahid*
Soft Error Detection for Iterative Applications Using Offline Training
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
InCheck: An In-application Recovery Scheme for Soft Errors
Sampoorani, Sivakumar and Joshua
Software Techniques for Soft Error Resilience
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error Protection for Commodity Embedded Microprocessors using Profile Information Daya S Khudia, Griffin Wright and Scott Mahlke Computer Science and Engineering University of Michigan Ann Arbor

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Soft Errors Soft errors, also called single-event upsets(SEUs) ► Occur because of Neutrons from cosmic rays Alpha particles from packaging materials Voltage fluctuations or signal interference Parameters affecting soft error rates ► Technology scaling ► Voltage scaling Image credit: Certichip 2

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Soft error rate (SER) Past Present Future Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips One failure per DAY per 100 chips One failure per DAY per chip [Feng’10, Shivakumar’02] At high error rates, mainstream systems can experience unacceptable soft errors There is a need for mainstream solutions 3

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science  Traditional dual/triple – modular redundancy  Run on separate hardware and compare results  Mission-critical reliability w/ high hardware costs [IBM Z-series, HP NonStop]  Traditional dual/triple – modular redundancy  Run on separate hardware and compare results  Mission-critical reliability w/ high hardware costs [IBM Z-series, HP NonStop]  Utilize multiple threads (temporal) instead of separate hardware (spatial)  Retain high coverage but sacrifice performance costs to save area [AR-SMT, Reunion]  Utilize multiple threads (temporal) instead of separate hardware (spatial)  Retain high coverage but sacrifice performance costs to save area [AR-SMT, Reunion]  Perform selective checking  software invariants  critical μarch structures [ARGUS, DIVA, Reddy:DSN`08]  Perform selective checking  software invariants  critical μarch structures [ARGUS, DIVA, Reddy:DSN`08]  Redundant execution in a single- threaded context  compiler interleaves original and redundant instructions  “tunable” coverage [SWIFT, EDDI]  Redundant execution in a single- threaded context  compiler interleaves original and redundant instructions  “tunable” coverage [SWIFT, EDDI]  Relies on anomalous behavior to identify faults  extremely cheap  decent coverage [RESTORE, SWAT]  Relies on anomalous behavior to identify faults  extremely cheap  decent coverage [RESTORE, SWAT]  Bridge the gap between symptom- based schemes and instruction duplication  “reliability for the masses”  sacrifice a little on coverage to maintain very low costs  Bridge the gap between symptom- based schemes and instruction duplication  “reliability for the masses”  sacrifice a little on coverage to maintain very low costs Traditional Architectural Solutions Increasing Fault Coverage Increasing Overheads (area, power, performance, etc.) n-Modular Redundancy Redundant Multi-threading Invariant Checking Symptom-based Shoestring Instruction Duplication Performance Overhead and Fault Coverage Target Performance Overhead and Fault Coverage Target 4

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Target solution Fault coverage from different sources of masking (75% to 92%) Proposed Solution Increasing Overheads (performance) Instruction duplication-based detection Symptom-based detection Hardware exceptions Branch mispredicts Cache misses Target solution Increasing Fault Coverage Provide affordable reliability for commodity systems on a cheap budget  Exploit cheap symptom-based fault detection  Judiciously apply sw-level instruction duplication to improve coverage Provide affordable reliability for commodity systems on a cheap budget  Exploit cheap symptom-based fault detection  Judiciously apply sw-level instruction duplication to improve coverage 5

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Contributions Exploit synergy between symptoms and SW-duplication A software solution to generate intelligent code to detect soft errors ► No user annotations are required Profile based intelligence in the analysis ► Memory profiling and Edge profiling Some instructions are more important than others ► Value profiling for generating more software symptoms Exploit statistical invariance 6

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 Outline Background System level overview Intelligent duplication Profiling techniques Experimental evaluation Conclusions

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science System Overview Intelligent Duplication  Instrument and get profile data for  edge-profiling  memory profiling for alias analysis  value profiling Profiling  Analyze program structure  Load profile data and perform selective duplication Operating System Physical Hardware  Trigger Lightweight recovery based on  selective symptoms (hardware exceptions)  comparison fail in duplicated code Runtime Compilation 8

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Compilation Code analysis and intelligent duplication ► can be used with code written in various languages ► Is independent of target machine For our Low-cost solution, we target an ARM backend intermediate representation (IR) Code analysis and intelligent duplication (IR to IR) Code generation Application source code Application binary analyses and optimizations ClassificationAnalysisDuplication 9

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Baseline Classification and Analysis High value instructions: ► Likely to produce corrupted program output if they consume corrupted input Symptom-generating instructions: ► Likely to produce symptom if they consume corrupted inputs Safe Instructions: ► Normally covered by symptom generating instructions Refs: Shoestring, EDDI, SWIFT 10 ld1 call printf (---, op1, ---) op1 = ld1 + 1 op1 ld1 = load addr --- addr = addr1 + 8 Safe Symp-Gen High Value

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Instruction Duplication Recursively duplicate instructions starting from operands of high-value instructions ► Stop if Already duplicated Safe No more producers == Recovery or continue execution original instrs duplicated cmps and branches Safe Symptom generating High value

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Baseline Coverage and Overhead 90% of fault coverage at 40.50% overhead Goal: Reduce overhead without affecting fault coverage 12

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Profile Based Duplication Memory Profiling ► Silent stores Need not be protected because expected to write the same value again ► Get load/store alias information Edge Profiling ► Do not protect an infrequently executed instruction by duplicating frequently executing instructions Value Profiling ► Use the statistical invariance to generate more software symptoms By incorporating dynamic behavior, perform intelligent duplication 13

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Refined Duplication Process ld1 call printf (---, op1, ---) op1 = ld1 + 1 op1 ld1 = load addr Store dt1, addr st1 dt1 = incr + 1 D op1 D op1 = ld1 + 1 cmpD st1 D st1 = D incr + 1 cmp Trigger recovery --- br original instrs duplicated cmps and branches Trigger recovery br F T F T --- Control flow Data flow Dependence through memory lib calls Only consider lib calls 14

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Memory Profiling: Silent Stores A store is silent if it writes the already existing value at memory location A large fraction of silent stores exist and can be exploited for intelligent code duplication 15

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Exploiting Silent Stores ld1 call printf (---, op1, ---) op1 = ld1 + 1 op1 ld1 = load addr store dt1, addr st1 dt1 = incr + 1 D op1 D op1 = ld1 + 1 cmpD st1 D st1 = D incr + 1 cmp Trigger recovery --- br original instrs duplicated cmps and branches Trigger recovery br F T F T --- Silent stores are expected to write the same value again soon silent store Control flow Data flow 16

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Edge Profiling: Recursive Duplication Break Do not protect a non frequently executed instruction by duplicating a frequently executed instruction BB0: BB1: BB2: Control flow Data flow BB3:BB4: 10 17

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Value Profiling What are the most frequently value produced by an instruction? If a value is produced more than 99% of the times, use that value for symptom generation addxor Most of the times, result is 0 Most of the times, result is 71 18

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Creating Software Symptoms If in a chain of instruction, last instruction produces the same value ► Insert the comparison with the frequently generated value ---op--- Generates the same value very frequently 19

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science --- Software Symptom Generation op2 call printf (---, op1, ---) op1 = op2 + 1 op1--- op2 = op3 * op4 D op1 D op1 = D op2 + 1 cmp D op2 = D op3 * D op4 cmp Trigger recovery br original instrs duplicated cmps and branches Trigger recovery br F T F --- 0D op2 --- T Produces 0 more than 99% 20

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation Methodology Program analysis, Profiling and intelligent duplication ► Implemented as compiler pass in the LLVM compiler Input sensitivity of profiling ► train input of SPECINT2K for training ► ref input for the actual fault injection runs Statistical fault injection (SFI) experiments ► GEM5 simulator in ARM syscall emulation mode Random (single) bit flip in physical register file ► Simulated entire benchmarks after fault injection ► Log files analyzed for results classification 21

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Fault Injection Outcome Classification Masked ► No corruption in the program output SWDetects ► Detected by duplication Covered by symptoms ► Produces a symptom such as page fault in 1000 cycles of fault injection Failures ► Fail status on program termination or program did not terminate in 20 Hrs. SDCs (Silent Data Corruptions) ► Fault injections which results in user visible corruptions 22

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Performance Overhead 23 Over 40% reduction in overhead

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Fault Coverage 24

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Proposed software-only low overhead fault detection technique ► yields intelligent and efficiently duplicated code Profile based analysis ► Silent stores need not be protected ► Don’t protect infrequently executed instructions by duplicating frequently executed instructions ► Value profiling can be used to generate software symptoms Intelligent duplication yields ► Over 40% reduction in overhead ► Statistical insignificant difference in fault coverage 25

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 26