DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

Slides:



Advertisements
Similar presentations
For(int i = 1; i
Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
On Cosmic Rays, Bat Droppings and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August.
Carrier Ethernet Network Solutions: The Benefits of End-To-End Ethernet Service Management Slide 1 Carrier Ethernet Network Solutions: The Benefits of.
Slides based on Kewal Saluja
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Professor: Ming-Shyan Wang Student: Yi-Ting Lin Missing-Sensor-Fault-Tolerant Control for SSSC FACTS Device With Real-Time Implementation Wei Qiao, Member,
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Making Services Fault Tolerant
1 Testing Effectiveness and Reliability Modeling for Diverse Software Systems CAI Xia Ph.D Term 4 April 28, 2005.
Reliability on Web Services Pat Chan 31 Oct 2006.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Yinglei Wang, Wing-kei Yu, Sarah Q. Xu, Edwin Kan, and G. Edward Suh Cornell University Tuan Tran.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Expediting Programmer AWAREness of Anomalous Code Sarah E. Smith Laurie Williams Jun Xu November 11, 2005.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
TECHNOLOGY GUIDE THREE Emerging Types of Enterprise Computing.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Design of Robust, Energy-Efficient Full Adders for Deep-Submicrometer Design Using Hybrid-CMOS Logic Style Sumeer Goel, Ashok Kumar, and Magdy A. Bayoumi.
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection.
1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Qiang XU CUhk REliable computing laboratory (CURE)
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
Weak SRAM Cell Fault Model and a DFT Technique Mohammad Sharifkhani, with special thanks to Andrei Pavlov University of Waterloo.
INFORMATION X INFO425: Systems Design Systems Design Project Deliverable 1.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Paper Title Author/affiliation/ address. Outline (0-1 slide) Motivation Background information/Related works Proposed Method Results Summary Future.
Hrushikesh Chavan Younggyun Cho Structural Fault Tolerance for SOC.
1 University of Virginia Computer Science 2 NVIDIA Research A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics.
TECHNOLOGY GUIDE THREE Emerging Types of Enterprise Computing.
Oct 31 st 2007University of Utah1 Multi-Cores: Architecture/VLSI Perspective The Hardware-Software Relationship: Date or Dump?
By shooting. Optimal parameters estimation Sample collect Various finger size Hard press and soft press Exhaustive search.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
EE415 VLSI Design THE INVERTER [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Carnegie Mellon University, *Seagate Technology
2.7 Line Monitoring Through Relay Team RST: Daniel Schlickeisen, Santiago Soriano, and Brittany Torelli Faculty Advisor: Dr. Larry Larson Sponsored by.
CALTECH CS137 Fall DeHon CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection.
By: Kevin Arnold. Simple Definition Brief History RAID Levels Comparison Benefits, Disadvantages Cost Uses Conclusion Questions? Sources.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
The Next Generation Transistor Material
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Fault Tolerance In Operating System
Supporting Fault-Tolerance in Streaming Grid Applications
Mattan Erez The University of Texas at Austin July 2015
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
ISCA 2000 Panel Slow Wires, Hot Chips, and Leaky Transistors: New Challenges in the New Millennium Moderator: Shubu Mukherjee VSSAD, Alpha Technology Compaq.
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme
Centre for Technology Alternatives for Rural Areas, IIT Bombay
Presentation transcript:

DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang

OUTLINE  Background  Motivation  Key Ideas  Introduction to CRAFT  Summary and Discussion Points

0 1 BACKGROUND Smaller and Faster Transistors  Lower threshold voltage  Tighter noise margins  Less reliable Results  Incorrect program execution Recovery Alpha Particle Transient Faults Software Only Hardware Only REDUNDENCY Int main() { cout << “Hello\n”; } Int main() { cout << “Hello\n”; }

MOTIVATION AND GOAL Software Only  Inadequate coverage  Slow Hardware Only  Large Overhead/Area  High cost Hybrid Solution Better Reliability and Performance Lower Hardware Area and Cost

KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics : - Based on software technique - Minimal hardware adaptations - Take advantages from Software and Hardware solution Benefits : - Nearly perfect reliability - Low performance degradation - Low hardware cost Software Hardware

CRAFT: HYBRID OF EXISTING METHODS Hardware Method Software Method  Redundant Multithreading Technique (RMT)  Error Correcting Codes (ECC)  Software Implemented Fault Tolerance (SWIFT)  Error Detection by Duplicating Instructions (EDDI) Advantages Almost-perfect fault coverage Low performance cost Advantages High fault coverage Modest performance cost Zero hardware cost

EXISTING METHOD: HARDWARE RMT  RMT makes use of SMT resource through loosely synchronized redundant threads  Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC) Original Thread Checker Thread Redundant Multi- threading (RMT)

EXISTING METHOD: SOFTWARE SWIFT  A compiler based transformation  Store instruction is the synchronization point  Assumes that Error Correction Code (ECC) guards correctness of memory subsystem ld r3 = [r4] add r1 = r2, r3 st m[r1] = r2 (Original Code) ld r3 = [r4] mov r3’ = r3 add r1 = r2, r3 add r1’ = r2’, r3’ br Fault, r1 != r1’ br Fault, r2 != r2’ br Fault, r3 != r3’ st m[r1] = r2 (SWIFT Code)

CRAFT: SUITE OF THREE DETECTION SYSTEM Preliminaries List of the Suite: 1.Checking Store Buffer (CSB) 2.Load Value Queue (LVQ) 3.CSB + LVQ  Assume Single Event Upset fault model  Architecturally Correct Execution (ACE)  Detected Unrecoverable Error (DUE)  Silent Data Corruption (SDC)

SUITE 1: CHECKING STORE BUFFER (CSB) Solution: Add a Store Buffer to perform checks Problem to Improve: SWIFT: Vulnerable to faults in the time interval between the validation and use of a register value Use of validated valuesValidated values Vulnerable to Faults

CSB : IMPLEMENTATION Basic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions Step 1: st [r1] = r2 Compiler duplicates the stores with single- bit version name st 1 [r1] = r2 st 2 [rt’] = r2’ Step 2: New store entries are put into CSB Duplicate entries discarded if match, marks OK to execute Step 3: Unchecked stores will be clogged at head of CSB Fault detected when CSB is filled

CSB #0123 Address-- 0xFF0xEE Value-- 0x80x1 Validated-- NN 0xFF 0x8 0xEE 0x2 Compiler duplicates stores st [r1] = r2  st1 [r1] = r2 st2 [r1’] = r2’ Not match, not OK to go to MEM CSB : IMPLEMENTATION Basic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions Table will fill up and structural hazard Insn duplicate #1 Insn duplicate #2 Y N Store Value Checks Out! Send to MEM.

CSB : ADVANTAGES/ DISADVANTAGES  Checking implemented in hardware level No longer need validation code; reduces code size  Store instructions are no longer synchronization points (SWIFT) Exploit more dynamic scheduling Advantages Disadvantages  Additional compiler requirements: distance between duplicated instruction should not exceed size of CSB

SUITE 2: LOAD VALUE QUEUE (LVQ) Problem to Improve: SWIFT: Verify loads by generating move instruction after load, keep a copy of value Solution: Add a load value queue br faultDet, r2 != r2’ ld r1 = [r2] mov r1’ = r1

SUITE 2: LOAD VALUE QUEUE (LVQ) Problem to Improve: SWIFT: Window of vulnerability between load instruction and value duplication. Solution: Add a load value queue Vulnerable to Faults Copying valuesLoading values

LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution Step 1: ld [r1] = r2 Compiler duplicates the stores with single- bit version name ld 1 [r1] = r2 ld 2 [r1’] = r2’ Step 2: Duplicated load bypassed from LVQ De-allocate entry from LVQ when two copies match Step 3: Fault detected when two duplicated copies fail to match

LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution LVQ #0123 Address-- 0xAA0xBB Value-- 0x20x1 Validated-- NN 0xAA 0x2 0xBB 0x1 Compiler duplicates loads ld [r1] = r2  ld1 [r1] = r2 ld2 [r1’] = r2’ Error Detected! Load Value Checks Out! Insn #1 Insn #2 N Y

LVQ : IMPLEMENTATION PROCEDURE Threadmill: Branch to TEST Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution LVQ #0123 Address-- Value-- 0xAA Compiler duplicates loads ld [r1] = r2  ld1 [r1] = r2 ld2 [r1’] = r2’ ld insn ld insn duplicate 0xAA 0x2

LVQ : ADVANTAGES/ DISADVANTAGES Advantages Disadvantages  Extra hardware to enforce loads and their duplicates access same entry in LVQ  Reduces window of vulnerability by issuing duplicated load instruction  Keep memory traffic low by bypassing load value

SUITE 3: CSB + LVQ  Implements both CSB and LVQ simultaneously to software-only solutions like SWIFT

COMPARISON OF DIFFERENT APPROACHES TechniqueCategoryOpcode/ Control Load/ Store MemroyHardware Requirement RMTHWAll NoneSMT Base Machine + CSB + LVQ SWIFTSWSome None CRAFT: CSB + LVQ HybridSomeAllNoneCSB + LVQ

EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability: Inject randomly chosen faults to detailed microarchitectural simulation Each chosen bit-flip is tracked until completion of program Analyze final result to determine: - How much SDC is converted to DUE - How much work (# of application) did program complete before encountering SDC

EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC ImplementationPerformance CSBEnable better performance as it eliminates scheduling constraints LVQImpact varies by benchmark

SUMMARY AND CONCLUSION CRAFT, as compared to: Hybrid technique can provide better reliability with relatively low cost Software-only TechniqueHardware-only Technique Execution time reduction by 5%Significantly reduce area overhead SDC to DUE conversion rate increase by 75% Maintain comparable reliability

DISCUSSION POINTS  CRAFT detects fault when CSB is clogged Tradeoff between detection latency and more flexible scheduling?  Recovery method?  Evaluation in terms of coverage?

CRAFT Ad: Maintain all SWIFT ad and increase reliability Low cost, relatively high reliability Better performance than SWIFT DisAd: No recovery method CSB much higher performance than LVQ No evaluation on coverage Compiler is ISA&Microarch dependent