Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.

Slides:



Advertisements
Similar presentations
Digital System Design-II (CSEB312)
Advertisements

10/14/2005Caltech1 Reliable State Machines Dr. Gary R Burke California Institute of Technology Jet Propulsion Laboratory.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Microprocessor Reliability
Sequential circuits Part 1: flip flops All illustrations  , Jones & Bartlett Publishers LLC, (
SYEN 3330 Digital SystemsJung H. Kim 1 SYEN 3330 Digital Systems Chapter 6 – Part 1.
Design and Computer Modeling of Ultracapacitor Regenerative Braking System Adam Klefstad, Dr. Kim Pierson Department of Physics & Astronomy UW-Eau Claire.
Reap What You Sow: Spare Cells for Post-Silicon Metal Fix Kai-hui Chang, Igor L. Markov and Valeria Bertacco ISPD’08, Pages
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Chapter 19: Network Management Business Data Communications, 4e.
Copyright 2001, Agrawal & BushnellDay-1 PM Lecture 4a1 Design for Testability Theory and Practice Lecture 4a: Simulation n What is simulation? n Design.
ENGIN112 L28: Timing Analysis November 7, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 28 Timing Analysis.
A Delay-efficient Radiation-hard Digital Design Approach Using Code Word State Preserving (CWSP) Elements Charu Nagpal Rajesh Garg Sunil P. Khatri Department.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
1 Introduction VLSI Testing. 2 Overview First digital products (mid 1940's) Complexity:low MTTF:hours Cost:high Present day products (mid 1980's) Complexity:high.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
1 Application Specific Integrated Circuits. 2 What is an ASIC? An application-specific integrated circuit (ASIC) is an integrated circuit (IC) customized.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Laboratory of Reliable Computing Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan Delay Defect Characteristics and Testing.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
DIGITAL ELECTRONICS CIRCUIT P.K.NAYAK P.K.NAYAK ASST. PROFESSOR SYNERGY INSTITUTE OF ENGINEERING & TECHNOLOGY.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
1 Enhancing Random Access Scan for Soft Error Tolerance Fan Wang* Vishwani D. Agrawal Department of Electrical and Computer Engineering, Auburn University,
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
1 Efficient Analytical Determination of the SEU- induced Pulse Shape Rajesh Garg Sunil P. Khatri Department of ECE Texas A&M University College Station,
Alec Stanculescu, Fintronic USA Alex Zamfirescu, ASC MAPLD 2004 September 8-10, Design Verification Method for.
Chapter 8 Architecture Analysis. 8 – Architecture Analysis 8.1 Analysis Techniques 8.2 Quantitative Analysis  Performance Views  Performance.
Reduced Cost Reliability via Statistical Model Detection Jon-Paul Anderson- PhD Student Dr. Brent Nelson- Faculty Dr. Mike Wirthlin- Faculty Brigham Young.
Ketan Patel, Igor Markov, John Hayes {knpatel, imarkov, University of Michigan Abstract Circuit reliability is an increasingly important.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©2009 Mladen Kezunovic. Improving Relay Performance By Off-line and On-line Evaluation Mladen Kezunovic Jinfeng Ren, Chengzong Pang Texas A&M University,
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Ghazanfar (Hossein) Asadi and Mehdi B. Tahoori Why Soft Error Rate (SER) Estimation?
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
TOPIC : Different levels of Fault model UNIT 2 : Fault Modeling Module 2.1 Modeling Physical fault to logical fault.
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
D_160 / MAPLD Burke 1 Fault Tolerant State Machines Gary Burke, Stephanie Taft Jet Propulsion Laboratory, California Institute of Technology.
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
CS/EE 3700 : Fundamentals of Digital System Design
Hrushikesh Chavan Younggyun Cho Structural Fault Tolerance for SOC.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
TOPIC : Introduction to Sequential Circuits UNIT 1: Modeling and Simulation Module 4 : Modeling Sequential Circuits.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
Week#3 Software Quality Engineering.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
1 Introduction to Engineering Fall 2006 Lecture 17: Digital Tools 1.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 61 Lecture 6 Logic Simulation n What is simulation? n Design verification n Circuit modeling n True-value.
Software Architecture in Practice
VLSI Testing Lecture 5: Logic Simulation
VLSI Testing Lecture 5: Logic Simulation
Vishwani D. Agrawal Department of ECE, Auburn University
These chips are operates at 50MHz clock frequency.
Fault Injection: A Method for Validating Fault-tolerant System
Timing Analysis 11/21/2018.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
Chapter 1 Introduction.
Advancement on the Analysis and Mitigation of
Guihai Yan, Yinhe Han, and Xiaowei Li
Presentation transcript:

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level Timing Analysis Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering University of MichiganUniversity of Texas at Austin

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 2 Introduction Recently there is a growing concern about transient faults in combinational logic Numerous techniques already exist that deal with the effects of transient faults: – Error Correction Codes (ECC) – DIVA – Simultaneous Redundantly Threading (SRT) – and many other… However, these techniques come with a cost on performance, power, die size and design time.

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 3 Introduction Designers have to trade-off between reliability provided and implementation cost Inadequate soft-error protection maybe useless due to poor reliability Excessive soft-error protection uncompetitive in cost and/or performance In order to balance this trade-off, system designers need accurate SERs (Soft-Error Rate) for their designs The device community provides raw SERs for devices of current technologies and projections for devices of future technologies However, architecture-level and circuit-level phenomena derate the raw SER Accurately assessing a design’s SER requires circuit-level detail analysis infrastructure

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 4 In This Work… We introduce a high-fidelity, high-performance simulation infrastructure for estimating soft-error rates – asynchronously injects voltage pulses of various durations at the gate level – accurately gauge detailed circuit phenomena to model: fault introduction fault propagation and possible fault masking – simulates with sufficient speed permitting the examination of entire workloads on complex designs (thousands of gates)

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 5 Soft Error Masking Fortunately not all transient faults cause an error – Circuit and architectural phenomena prevent the fault from propagating to the design’s output and causing an error Logic masking Timing masking Electrical masking Microarchitecture masking Software masking

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 6 Soft Error Masking Logic Masking : Logic Masking : the fault gets blocked by a following gate whose output is completely determined by its other inputs Timing Masking : Timing Masking : the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input Electrical Masking : Electrical Masking : the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input Microarchitectural Masking : Microarchitectural Masking : the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output Software Masking : Software Masking : the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 7 Design Under Test: Design Under Test: gate-level description of the design (netlist) - Fault-Exposed Model: subjected to fault injection - Golden Model: no fault injected Fault Generator : Fault Generator : injects voltage pulses of various durations at any gate in the design and flips the value of any flip-flop in the design - faults are uniformly distributed at time, location and duration Simulation Infrastructure Fault Analyzer : Fault Analyzer : Monitors manifested errors and tracks all the possible ways a fault can be masked Model Stimuli Model Stimuli : Workload traces that exercise the design under test

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 8 Statistical Model for Transient Faults Pulse-based model for transient faults caused by energetic particle strikes Faults injected into combinational logic are classified based on their duration – 20%, 40%, 60%, 80% and 100% of design’s clock period Faults injected into sequential elements flip their value The arrival rate of each type of fault is modeled by a separate random variable The mean inter-arrival times for each fault type are derived by previously published data and detailed SPICE simulations

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 9 Design Under Test – CMP Switch We chose as a design under test a single chip multiprocessor interconnection switch (baseline provided by Li-Shiuan Peh) – Much less complex than a microprocessor yet not too simplistic (it includes finite state machines, buffers, control logic, and buses) Wormhole switch pipelined at the flit level Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements Realistic workload – Communication traces derived from the TRIPS architecture

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 10 Characterization per Fault Type High microarchitectural masking – 95% of the faults that flip a flip-flop’s value are masked Timing masking is significant only for faults with small pulse durations Logic masking is increasing as the fault’s pulse duration is decreasing 51.7% logic masking 2.2% timing masking 42.9% μarch masking 3.2% error

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 11 Derating Factor Derating factor = error rate -1 – i.e. a derating factor of 30 means that one of every 30 injected faults will cause an error (corresponds to an error rate of 3.3%) Average derating factor for realistic workloads is 31 Synthetic high utilization workload leads to a derating factor of 12 error rate: 3.2% error rate: 8.3%

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 12 Failure Rate Projections Taking into account projections from ITRS and raw SER estimates for future process technologies, we make failure rate projections considering the transient-fault derating effects Design architecture is kept intact for future process technologies Two different designs: – one clocked with the projected clock frequencies for microprocessors – and one clocked with the projected clock frequencies for interconnection networks

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 13 Transient-fault Vulnerability per Component We observed that each switch component exhibited different vulnerability on transient faults Derating effects greatly depend on the component’s characteristics Most vulnerable component – Switch Arbiter (12.8% error) – 6% of switch’s area Input Controllers – dominate switch design – 86% of switch’s area The switch’s vulnerability match with that of input controllers

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 14 Effects of Multi-fault Strikes A single strike causes multiple faults on neighbouring gates or flip-flops – lack of data about frequency of such events or models for multi-fault strikes on logic gates and flip-flops – we assume that each strike causes multiple faults extremely pessimistic – even under this severe environment the failure rates are relatively low

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 15 Conclusions – Directions for Future Work Conclusions For complex designs there is significant fault masking, with derating factors as high as 30 Soft-error derating effects highly depend on the design’s characteristics and utilization Our observations suggest that the soft-error reliability threat might have been overstated by the computer architecture community – Designers need to evaluate their design’s soft-error tolerance with detail analysis tools considering circuit level derating effects and better trade-off between the protection provided and the implementation cost Future Work Study the soft-error derating effects for several designs with different amount of complexity and different characteristics Enhance our simulation infrastructure to be able to simulate large high- complexity systems (millions of gates) with short simulation runs

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 16 Questions?