1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu Intel Corporation.

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
MURI Neutron-Induced Multiple-Bit Upset Alan D. Tipton 1, Jonathan A. Pellish 1, Patrick R. Fleming 1, Ronald D. Schrimpf.
Sp09 CMPEN 411 L16 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Introduction to Soft Errors [Adapted from Rabaey’s Digital Integrated Circuits,
Single Event Upsets (SEUs) – Soft Errors By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M University, College.
Circuit Modeling and Fault Injection Approach to Predict SEU Rate and MTTF in Complex Circuits Fabian Vargas, Alexandre Amory Catholic.
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Cosmic Rays Basic particle discovery. Cosmic Rays at Earth – Primaries (protons, nuclei) – Secondaries (pions) – Decay products (muons, photons, electrons)
March 16-18, 2008SSST'20081 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Vishwani D. Agrawal Department of Electrical and Computer.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Lesson 12 – NETWORK SERVERS Distinguish between servers and workstations. Choose servers for Windows NT and Netware. Maintain and troubleshoot servers.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Neutron Generation and Detection Lee Robertson Instrument & Source Division Oak Ridge National Laboratory 17 th National School on Neutron and X-ray Scattering.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
1 Efficient Analytical Determination of the SEU- induced Pulse Shape Rajesh Garg Sunil P. Khatri Department of ECE Texas A&M University College Station,
Applying the Distribution System in Grid Restoration/NERC CIP-014 Risk Assessment Srijib Mukherjee, Ph.D., P.E. UC Synergetic.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Presented by Anthony B. Sanders NASA/GSFC at 2005 MAPLD Conference, Washington, DC #196 1 ALTERA STRATIX TM EP1S25 FIELD-PROGRAMMABLE GATE ARRAY (FPGA)
Single Event Effects in microelectronic circuits Author: Klemen Koselj Advisor: Prof. Dr. Peter Križan.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
Seattle June 24-26, 2004 NASA/DoD IEEE Conference on Evolvable Hardware Self-Repairing Embryonic Memory Arrays Lucian Prodan Mihai Udrescu Mircea Vladutiu.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Cosmic-Ray Induced Neutrons: Recent Results from the Atmospheric Ionizing Radiation Measurements Aboard an ER-2 Airplane P. Goldhagen 1, J.M. Clem 2, J.W.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
K. Kolomvatsos 1, C. Anagnostopoulos 2, and S. Hadjiefthymiades 1 An Efficient Environmental Monitoring System adopting Data Fusion, Prediction & Fuzzy.
1 The NSEU Sensitivity of Static Latch Based FPGAs and Flash Storage CPLDs Joseph Fabula Jason Moore Austin Lesea Saar Drimer MAPLD2004 This work has benefited.
Timing Requirements for Spallation Neutron Sources Timing system clock synchronized to the storage ring’s revolution frequency. –LANSCE: MHz.
EE434 ASIC & Digital Systems Partha Pande School of EECS Washington State University
Hrushikesh Chavan Younggyun Cho Structural Fault Tolerance for SOC.
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
3/2003 Rev 1 I.2.0 – slide 1 of 12 Session I.2.0 Part I Review of Fundamentals Module 2Introduction Session 0Part I Table of Contents IAEA Post Graduate.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,
CS203 – Advanced Computer Architecture Dependability & Reliability.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
Winter Semester 2010 ”Politehnica” University of Timisoara Course No. 5: Expanding Bio-Inspiration: Towards Reliable MuxTree  Memory Arrays – Part 2 –
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Toward Exascale Resilience Part 3: Fault and error modes and models
nZDC: A compiler technique for near-Zero silent Data Corruption
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Soft Error Rates with Inertial and Logical Masking
Soft Error Detection for Iterative Applications Using Offline Training
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
InCheck: An In-application Recovery Scheme for Soft Errors
Advancement on the Analysis and Mitigation of
R.W. Mann and N. George ECE632 Dec. 2, 2008
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
SDC is in the eye of the beholder: A Survey and preliminary study
Guihai Yan, Yinhe Han, and Xiaowei Li
Co-designed Virtual Machines for Reliable Computer Systems
Presentation transcript:

1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu Intel Corporation

2 Outline Neutron SER characterization of microprocessors SER scaling trends Experimental set-up Experimental Results Other sources of errors Memory intermittent faults Front side bus intermittent faults Using environmental tests as dependability benchmarking tools Temperature and Voltage Operating Test ESD Operating Test Summary Backup Linpack benchmark References Acknowledgement Neutron SER characterization: Bruce Takala, Steve Wander (LANSCE), Nelson Tam, Pat Armstrong (Intel Corp.) Environmental testing: John Blair, Scott Scheuneman (Intel Corp.)

3 Neutron SER Characterization of Microprocessors

4 Single Event Upsets Single event upsets (SEU) are induced by Alpha particles – generated during radioactive decay of the package and interconnect materials Neutrons, protons, pions – generated by cosmic rays penetrating the atmosphere SEU may induce errors both in storage elements and combinational logic Frequency of occurrence of the particle induced induced errors: soft error rate (SER)

5 SER Scaling Trends SRAM SER per bit and chip Latch SER per bit and chip Assumption: SRAM/latch count increases ~2x per generation

6 Hadron Cascades Neutrons represent 94% of the hadrons reaching sea level For terrestrial applications it makes sense to benchmark the impact of neutron SER Main constituents of atmospheric hadron cascades

7 LANSCE Neutron Beam Los Alamos Neutron Science Center (LANSCE) Generates high-energy neutrons by spallation: a linear accelerator generates a pulsed proton beam that strikes a tungsten target Energy dependence of the natural cosmic-ray neutron flux and the LANSCE neutron flux

8 Experimental Set Up Itanium processor based server Windows NT 4.0 operating system Linpack benchmark Performs matrix computations Derives residues – can detect silent data corruption (SDC) Fission ion chamber to determine neutron fluence

9 Deriving MTTF MTTF = Tua/U Tua – duration of an equivalent experiment, taking place in unaccelerated conditions [h] U – total number of upsets (failures) over the duration of the experiment Tua = (Fcp * Nc)/ Nf Fcp – total number of fission chamber pulses, over the duration of the experiment Nc – average neutron conversion factor [neutrons/fission pulse/cm2] Nf – cosmic-ray induced neutron flux at the desired geographical location and altitude [neutrons/cm2/h]

10 Experimental Results Run Linpack benchmark for square matrixes of size 800 and 1000 Completed 40 runs Duration of one run: 10 s – 5 min Failure types Blue screen Hang Silent data corruption (SDC)

11 Experimental Results Itanium processor MTTF due to neutrons, as a function of number of runs

12 MTTF confidence intervals Experimental Results SDC – one event Insufficient for statistical analysis

13 Practical Considerations Error handling techniques differ greatly from one manufacturer to another HW error detection and correction, e.g. ECC, is faster FW/SW implemented recovery may be overwhelmed by an accelerated test (near coincident faults scenario) Acceleration factor is an important variable Failure prediction and automatic deconfiguration may lead to misleading results Multiple experiments Beam divergence Beam attenuation

14 Other Sources of Errors

15 Memory Intermittent Faults Intermittent faults are induced by unstable or marginal hardware Intermittent shorts/opens Manufacturing residuals Timing faults Number of memory single-bit errors reported by 193 systems over 16 months Daily number of memory single-bit errors reported by one system over 16 months

16 Front Side Bus Intermittent Faults Front side bus (FSB) errors Bursts of single-bit errors (SBE) on data path SBE detected and corrected (data path protected by ECC) Failure analysis results Intermittent contacts at solder joints Fault injection showed that similar faults experienced by control signals induce SDC

17 Using Environmental Tests as Dependability Benchmarking Tools

18 Temperature and Voltage Operating Test Profile of the test 9 systems experienced SDC SDC events: 134 (90.5%) Detected errors: 14 (9.5%) SDC preceded detected errors 70 o C 25 o C -10 o C Ten systems were tested Workload: Linpack benchmark

19 Temperature and Voltage Operating Test Distribution of the SDC events Failure analysis results Memory controller setup and hold-time violations

20 ESD Operating Test 4 servers from 2 manufacturers Workload: Linpack benchmark 30 test points per server 20 positive and 20 negative discharges per test point Air discharge 4 kV – 15 kV Contact discharge 8 kV One server experienced SDC 8% of the discharges targeted to the disk bay area (15 kV, air) First ESD operating test to reveal SDC in a commercially available server

21 Summary The need for dependability benchmarking is increasing Wider use of COTS components in critical applications Technology is a two edge sword  Higher performance  Higher rates of occurrence of the transient and intermittent faults SDC is a real threat We take for granted the correctness of the computer data Dependability benchmarks should determine whether the circuits/systems under evaluation experience SDC Fault injection techniques require in depth knowledge of the evaluated system Appropriate for designers and manufacturers Accelerated neutron tests and environmental tests are a “black box approach” Capable of unveiling SDC In depth knowledge of the system under test is not required Linpack benchmark is available for free Can be used both by manufacturers and independent evaluators

22 Backup

23 Linpack Benchmark Example of Linpack output: large residues indicate SDC

24 References “Neutron SER characterization of microprocessors”, Proc. of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005, pp “Dependability benchmarking using environmental test tools”, Proc. of the Reliability and Maintainability Symposium, Alexandria, VA, USA, January 2005, pp. 567 – 571. “Impact of deep submicron technology on dependability of VLSI circuits”, Proc. of the International Conference on Dependable Systems and Networks, Washington, DC, USA, June 2002, pp