Towards the design of tomorrow’s Reliable Computing Systems

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Sp09 CMPEN 411 L16 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Introduction to Soft Errors [Adapted from Rabaey’s Digital Integrated Circuits,
Microprocessor Reliability
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
CHALLENGES IN EMBEDDED MEMORY DESIGN AND TEST History and Trends In Embedded System Memory.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
CML CML Cache Vulnerability Equations for Protecting Data in Embedded Processor Caches from Soft Errors † Aviral Shrivastava, € Jongeun Lee, † Reiley Jeyapaul.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Alec Stanculescu, Fintronic USA Alex Zamfirescu, ASC MAPLD 2004 September 8-10, Design Verification Method for.
Low Power Techniques in Processor Design
CML CSE 591: Advances in Reliable Computing Aviral Shrivastava.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
CS203 – Advanced Computer Architecture Dependability & Reliability.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
1 Introduction to Engineering Fall 2006 Lecture 17: Digital Tools 1.
Chapter 5 - Internal Memory 5.1 Semiconductor Main Memory 5.2 Error Correction 5.3 Advanced DRAM Organization.
Rad (radiation) Hard Devices used in Space, Military Applications, Nuclear Power in-situ Instrumentation Savanna Krassau 4/21/2017 Abstract: Environments.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Memory COMPUTER ARCHITECTURE
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
SECTIONS 1-7 By Astha Chawla
nZDC: A compiler technique for near-Zero silent Data Corruption
Instructor: Dr. Phillip Jones
Architecture & Organization 1
Computer Architecture and Organization
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Hwisoo So. , Moslem Didehban#, Yohan Ko
Introduction to Computer Systems
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
NVIDIA Fermi Architecture
BIC 10503: COMPUTER ARCHITECTURE
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Lecture 6: Reliability, PCM
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
Mattan Erez The University of Texas at Austin July 2015
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Chapter 1 Introduction.
An Introduction to Software Architecture
Single Event Upset Simulation
R.W. Mann and N. George ECE632 Dec. 2, 2008
Computer Evolution and Performance
COMS 361 Computer Organization
Co-designed Virtual Machines for Reliable Computer Systems
Code Transformation for TLB Power Reduction
Automotive-semiconductors Functional Safety
Presentation transcript:

Towards the design of tomorrow’s Reliable Computing Systems Reiley Jeyapaul PhD, Compiler Microarchitecture Lab, ASU

What is Reliable Computing ? Definition of Reliability: “The quality of being dependable or reliable” In computing, how would you define Reliability ? Who cares about Reliability ? System User (you and me) System architect/designer Why does it matter and to what extent ? Application dependent Media applications  Lower priority Financial applications  High priority Medical applications  Critical priority

Sources of Failures in Systems Hard Faults (Permanent faults) Stuck-at faults Faulty circuit element (e.g., a wire or output of gate) Delay fault Effects of temperature, processor variations, Ageing : Onset of physical wear-out Electronmigration NBTI (Negative Bias Temperature Instability) Soft Faults (Temporary Faults) Program errors  software bugs/incorrect initialization, etc. Environmental factors Cosmic particles, temperature, physical effects, electromagnetic interference Non-environmental factors Loose connections, ageing, process variations, noise

Saving Galileo 1978 – Galileo commissioned for Jupiter exploration 1980 – Design and Architecture decided Use of AT 2901 for attitude control 1982 – Voyager reaches Jupiter Intermittent Resets Sulfur ions from Jupiter’s volcanic moon were being whipped up to high energy by the Jovian gravity. After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved” Overheads 5 years, 5 million dollars Sandia National Laboratories was subcontracted to custom-make radiation hardened AT 2901 Soft error problem is not new 30 years ago, NASA launched the project called Galileo to explore Jupiter and they used AT 2901 for attitude control but they failed. They realized that soft error is the main problem.

It started with nuclear tests… Electronic anomalies in monitoring equipment Could not be traced to any hardware fault Equipment worked properly after restart 1962: Wallmark and Marcus (RCA Labs, Princeton) Minimum size and Maximum Packing Density of Non- Redundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting microelectronics 1962: Telestar - First communication satellite July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit 100X increase in radiation Rendered the satellite unoperational worked after reboot

Radioactive Contamination 1978: Intel could not deliver chips to AT&T to upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging Packaging modules were contaminated with Uranium from and old uranium mine upstream. Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge generated by particle strike to cause a fault. 1986-87: IBM faced problems of radioactive contamination Traced problem to a distant chemical plant that used radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.

Fiscal Losses Mount 2000 Sun Microsystems 2000: Cisco Line Routers Caused Sun’s flagship servers to suddenly and mysteriously crash! Cosmic ray strikes on L2 cache with defective error protection Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations affected Verisign moved to IBM Unix servers (for the most part) 2000: Cisco Line Routers Intermittent router resets, due to soft errors on the processor memory, affecting the operation. 2004: Cypress Semiconductors Reported number of incidents of soft errors A single soft error crash the entire system farm Brought a billion dollar automotive factory to halt every month 2005: HP Server farm 2048-CPU server in LANL crashed frequently Transition: Of course, companies have started reacting to such strikes

Reactions from Companies Fujitsu SPARC in 130 nm technology 80% of 200k latches protected with parity compare with very few latches protected in Mckinley ISSCC, 2003 IBM declared 1000 years system MTBF as product goal very hard to achieve this goal in a cost-effective way Bossen, 2002 IRPS Workshop Talk nVIDIA Fermi GPUs Protect all memory and register file using ECC

Soft Error Skeptics Applications crash more often due to software bugs Limited # of bugs in mature software (e.g., servers, company environment) If we don’t do anything, soft errors will be the dominant failure rate This a server problem not a desktop problem Definitely a server (e.g., data center) problem Desktop problem from IT manager’s point of view Soft error rates increasing exponentially with scaling Will soon become a problem even for embedded systems Soft error is not a problem today Industry is at the cross-over point Future is worse, IF we don’t do anything

Radiation Induced Soft Errors When energetic particles hit a sensitive area in the chip, it generates electron holes and changes a bit value from zero to one or vice versa. = 1.64 x 10-10sec = 5.10x10-11sec Typically Induced current has a rapid rise time but a more gradual fall time

Soft Error Trends DRAM SRAM Logic System error rate of DRAMs is fairly constant SRAM Increasing exponentially Logic What’s the issue in soft errors? The main problem is its trend is increasing, and even significantly, in particular for SRAM and Logic.

Increasing Soft Error Rates Reducing features sizes and lower supply voltage Decreasing capacitive nodes and noise margins Q_critical reducing Exponentially more low-energy particles than high-energy ones More number of transistors per chip More functionality is moving on-chip Higher probability of error due to more faults. Increasing clock rates Larger fraction of time between setup and hold times for better error latching

One Failure per Day per Chip [Shivakumar et al 2002] Soft error rates could increase from one error per year to one error per day in a decade!

Transient Faults, Bit Flips, Soft Errors, etc. activation propagation FAULT ERROR FAILURE fault latency error latency Storage Device (e.g., Memory, Cache, Registers) Bit Flips = Transient Faults = Soft Errors Logical Device (e.g., ALU) Sequential Device (e.g., FF) System (e.g., Crash) Circuit Masking MA/SW Masking Transient Faults Soft Errors System Failures Processor Pipeline In the storage device like main memory and caches, bit flips = TF = SE On the other hand, TF can become active as soft errors after masked with some effects like circuit masking. Similarly, SE can cause system failures such as crashes.

Approaching the Soft Error Problem Hardware Interface Visible: Processor Hardware Chip design Obscured: Application behavior Soft Error Perspective: Fault  Soft Error (bit-flip) Protection: Correct the Soft Error

Processor Hardware Layers Device Level Circuit Level Chip Packaging Processor Microarchitecture

Hardware Techniques Transient faults are incident on the hardware bit Transistor and gate is affected by transient faults Attack the problem at its source Packaging techniques to shield the transistors Refining packaging material Circuit and device techniques to make them resilient to transient faults SOI (Silicon on Insulator), Fault resilient (hardened) latches. Architecture techniques to detect and correct bit-faults Parity (detect only) ECC (1-bit detect and correct) SECDED (1-bit detect and correct, 2-bit detect)

Soft Error Masking Effects in Electronic Circuitry – Electrical Masking Pulse attenuated by electrical resistance in the circuit Pulse still strong enough to be latched at output

Soft Error Masking Effects in Electronic Circuitry – Logical Masking Value unchanged at the gate

Soft Error Masking Effects in Electronic Circuitry – Logical Masking Error propagated to the output

Soft Error Masking Effects in Electronic Circuitry – Temporal Masking Transient Fault Soft Error A transient pulse at the latching window: Before tsetup  masked (not latched) After tsetup, Before thold  race condition At the latching window  not masked (latched) [Firouzi ROCS 2010]

Hardware Techniques for Protection Shielding at the package-layer Method to prevent strikes Limited by packaging design/cost and technology available Device level Techniques Scalability, design flexibility and cost are concerns Fabrication cost governs commercial applicability Circuit Level Techniques Masking Effects : Electrical Masking Temporal Masking Logical Masking Circuit implementations enhance the masking effects to protect SEU Limitations: Hardware overheads of area / power and design cost Overhead vs Need governs commercial applicability of methods

Approaching the Soft Error Problem Software Interface Visible: Application behavior Obscured: Processor Microarchitecture Chip design Soft Error Perspective: Anomaly in program execution Protection: Recover from incorrect execution

How do Soft Errors Manifest in the System? ESWEEK 2012 Tutorial Presentation 9/21/2018 Outcomes from a Soft Error Output Data Corruption Incorrect Program Execution System Crash Silent undetected data-corruption Masked Soft Errors Correct Program execution No system Crash Application/Software Random charged-particles causing bit-errors in h/w components Compiler PROCESSOR Program Output Executable Binary Not all bit-errors in the processor hardware translate into system level errors (or Failures). -- The reason is Masking

Software-level Masking Effects - Logical/Arithmetic Masking Program If A > 0 … Block 1 Then Block 2 EndIf Scenario 1 A = 5 (0x0101) If A > 0 … Block 1 Then Block 2 EndIf Scenario 2 A = 0 (0x0000) If A > 0 … Block 1 Then Block 2 EndIf A = 7 (0x0111) A = 2 (0x0010) Expected Block 1 executed. Expected Block 2 is NOT executed.

Software-level Masking Effects - Control Flow Masking Scenario 1 Scenario 2 Program B = 5 C = 10 If A > 0 … B = B + 5 Then C = C + 2 EndIf A = 5 A = 0 B = 5 C = 10 If A > 0 … B = B + 5 Then C = C + 2 EndIf B = 5 C = 10 If A > 0 … B = B + 5 Then C = C + 2 EndIf Error in B manifests into Failure Error in B is Masked

Software-level Masking Effects Other Masking Effects Data-value masking: If the value of one variable in a multiplication is 0, error in the other variable is masked. Error in a variable which may be reset for future use in the program is masked. e.g., error in the index variable of a for loop. Dynamic dead-code: Error in a variable, when used in a computation block results in incorrect temporary data. But if this temporary data computed is not used in the computation of program output or execution, this computed block of code is dynamically dead.

Transient Faults, Bit Flips, Soft Errors, etc. activation propagation FAULT ERROR FAILURE fault latency error latency Storage Device (e.g., Memory, Cache, Registers) Bit Flips = Transient Faults = Soft Errors Logical Device (e.g., ALU) Sequential Device (e.g., FF) System (e.g., Crash) Circuit Masking MA/SW Masking Transient Faults Soft Errors System Failures Processor Pipeline In the storage device like main memory and caches, bit flips = TF = SE On the other hand, TF can become active as soft errors after masked with some effects like circuit masking. Similarly, SE can cause system failures such as crashes.

Software Techniques for Protection The key ideas include: Reduce time that vulnerable data resides on the components Detect and correct errors (if any) after execution through the components. Software based techniques vary based on the components protected (coverage). L1 Cache Register File protection Pipeline core and buffers Redundancy based techniques Control flow based techniques

Research Front Error Detection and Recovery Techniques Recovery through re-start may be an acceptable option Cost of integrated recovery mechanism is substantial Recovery through re-start may not be acceptable for HPC systems Advantage of Compiler-based methods Can analyze soft errors that transcend the microarchitecture and software level masking effects Can implement smart optimizations for efficient protection Limitations of Software approaches Granularity of optimizations is larger and therefore is a limitation Added code for detection/correction, are again vulnerable to soft errors and therefore may cancel out the protection acheived.

Approaching the Soft Error Problem Hardware-Software Interface Visible: Processor Hardware Application behavior Obscured: Chip Design Soft Error Perspective: Fault  Soft Error (bit-flip)  Failure Protection: Protect against Soft Error induced Failures

How to Estimate Soft Errors ? Soft Error : A data bit-flip during program execution that translates into erroneous output. Processor Pipeline Register File ? Application Binary Output Output Cache (Instruction/ Data) Buffers Analysis is most relevant at the interface between : A data-bit and A sequential element

Data Vulnerability & Soft Errors Data processed by the system is exposed by the hardware components in the processor, to: charged-particles that strike the processor and other sources of transient errors - electrical noise, cross-talk, etc. Exposed data is vulnerable to bit-flips, that manifest into failures in the system as, Erroneous output System failure Output Data errors The probability of a bit-error manifesting into soft errors α Time duration data is exposed in h/w An error on an exposed data-bit in the processor, could lead to system errors if, the bit will be used in the process execution in the system. Only the exposed and actively-used data-bits in the system are deemed vulnerable during process execution.

Vulnerability in the Cache time I R Instruction Cache E t0 t1 t2 t3 t4 t5 t6 t7 time I W R E Data Cache t0 t1 t2 t3 t4 t5

Vulnerability Distribution in the processor with protected Cache

Code Transformations for Vulnerability Reduction ESWEEK 2012 Tutorial Presentation 9/21/2018 [Shrivastava et al 2010] Loop Interchange on Matrix Multiplication Vulnerability trend not same as performance Interesting configurations exist, with either low vulnerability or low runtime. 52X variation in vulnerability for 1% variation in runtime Opportunities may exist to trade off little runtime for large savings in vulnerability

Vulnerability depends on the data access pattern for ( i : 0 ≤ i < N ) { for ( k : 0 ≤ k < N ) { for ( j : 0 ≤ j < N ) { A[i][k] += B[i][j] * C[j][k] } for ( i : 0 ≤ i < N ) { for ( j : 0 ≤ j < N ) { for ( k : 0 ≤ k < N ) { A[i][k] += B[i][j] * C[j][k] } Low Vulnerability But Bad Performance High Vulnerability But Good Performance Completely compute A[i][k] in the innermost loop  Less lifetime of A[i][k] Need A[i][k] across iterations of outermost loop  Longer lifetime of A[i][k] 9/21/2018

Soft Error Protection at H/w-S/w interface Vulnerability of data is directly proportional to the probability of Soft Error Failures Fault Injection Simulation Static Estimation Estimation of Vulnerability is a key factor for: Comparative analysis of two protection designs Design space exploration of architecture designs

Interesting Research Questions How to design cross-layer protection ? What component should be protected and which layer? How to co-ordinate among the techniques across layers? Evaluation Metric for Soft Error Protection No comprehensive system-level estimation method Compiler has immense potential to contribute but lacks comprehensive static estimation methods. Time based vulnerability in systems Vulnerability of an application is not static through time. Vulnerability of Multi-core systems Inter-thread communication affects vulnerability analysis Soft Error Protection in HPC Systems Communication overhead limits use of multi-core techniques

Thank You !