Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards the design of tomorrow’s Reliable Computing Systems

Similar presentations


Presentation on theme: "Towards the design of tomorrow’s Reliable Computing Systems"— Presentation transcript:

1 Towards the design of tomorrow’s Reliable Computing Systems
Reiley Jeyapaul PhD, Compiler Microarchitecture Lab, ASU

2 What is Reliable Computing ?
Definition of Reliability: “The quality of being dependable or reliable” In computing, how would you define Reliability ? Who cares about Reliability ? System User (you and me) System architect/designer Why does it matter and to what extent ? Application dependent Media applications  Lower priority Financial applications  High priority Medical applications  Critical priority

3 Sources of Failures in Systems
Hard Faults (Permanent faults) Stuck-at faults Faulty circuit element (e.g., a wire or output of gate) Delay fault Effects of temperature, processor variations, Ageing : Onset of physical wear-out Electronmigration NBTI (Negative Bias Temperature Instability) Soft Faults (Temporary Faults) Program errors  software bugs/incorrect initialization, etc. Environmental factors Cosmic particles, temperature, physical effects, electromagnetic interference Non-environmental factors Loose connections, ageing, process variations, noise

4 Saving Galileo 1978 – Galileo commissioned for Jupiter exploration
1980 – Design and Architecture decided Use of AT 2901 for attitude control 1982 – Voyager reaches Jupiter Intermittent Resets Sulfur ions from Jupiter’s volcanic moon were being whipped up to high energy by the Jovian gravity. After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved” Overheads 5 years, 5 million dollars Sandia National Laboratories was subcontracted to custom-make radiation hardened AT 2901 Soft error problem is not new 30 years ago, NASA launched the project called Galileo to explore Jupiter and they used AT 2901 for attitude control but they failed. They realized that soft error is the main problem.

5 It started with nuclear tests…
Electronic anomalies in monitoring equipment Could not be traced to any hardware fault Equipment worked properly after restart 1962: Wallmark and Marcus (RCA Labs, Princeton) Minimum size and Maximum Packing Density of Non- Redundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting microelectronics 1962: Telestar - First communication satellite July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit 100X increase in radiation Rendered the satellite unoperational worked after reboot

6 Radioactive Contamination
1978: Intel could not deliver chips to AT&T to upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging Packaging modules were contaminated with Uranium from and old uranium mine upstream. Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge generated by particle strike to cause a fault. : IBM faced problems of radioactive contamination Traced problem to a distant chemical plant that used radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.

7 Fiscal Losses Mount 2000 Sun Microsystems 2000: Cisco Line Routers
Caused Sun’s flagship servers to suddenly and mysteriously crash! Cosmic ray strikes on L2 cache with defective error protection Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations affected Verisign moved to IBM Unix servers (for the most part) 2000: Cisco Line Routers Intermittent router resets, due to soft errors on the processor memory, affecting the operation. 2004: Cypress Semiconductors Reported number of incidents of soft errors A single soft error crash the entire system farm Brought a billion dollar automotive factory to halt every month 2005: HP Server farm 2048-CPU server in LANL crashed frequently Transition: Of course, companies have started reacting to such strikes

8 Reactions from Companies
Fujitsu SPARC in 130 nm technology 80% of 200k latches protected with parity compare with very few latches protected in Mckinley ISSCC, 2003 IBM declared 1000 years system MTBF as product goal very hard to achieve this goal in a cost-effective way Bossen, 2002 IRPS Workshop Talk nVIDIA Fermi GPUs Protect all memory and register file using ECC

9 Soft Error Skeptics Applications crash more often due to software bugs
Limited # of bugs in mature software (e.g., servers, company environment) If we don’t do anything, soft errors will be the dominant failure rate This a server problem not a desktop problem Definitely a server (e.g., data center) problem Desktop problem from IT manager’s point of view Soft error rates increasing exponentially with scaling Will soon become a problem even for embedded systems Soft error is not a problem today Industry is at the cross-over point Future is worse, IF we don’t do anything

10 Radiation Induced Soft Errors
When energetic particles hit a sensitive area in the chip, it generates electron holes and changes a bit value from zero to one or vice versa. = 1.64 x 10-10sec = 5.10x10-11sec Typically Induced current has a rapid rise time but a more gradual fall time

11 Soft Error Trends DRAM SRAM Logic
System error rate of DRAMs is fairly constant SRAM Increasing exponentially Logic What’s the issue in soft errors? The main problem is its trend is increasing, and even significantly, in particular for SRAM and Logic.

12 Increasing Soft Error Rates
Reducing features sizes and lower supply voltage Decreasing capacitive nodes and noise margins Q_critical reducing Exponentially more low-energy particles than high-energy ones More number of transistors per chip More functionality is moving on-chip Higher probability of error due to more faults. Increasing clock rates Larger fraction of time between setup and hold times for better error latching

13 One Failure per Day per Chip
[Shivakumar et al 2002] Soft error rates could increase from one error per year to one error per day in a decade!

14 Transient Faults, Bit Flips, Soft Errors, etc.
activation propagation FAULT ERROR FAILURE fault latency error latency Storage Device (e.g., Memory, Cache, Registers) Bit Flips = Transient Faults = Soft Errors Logical Device (e.g., ALU) Sequential Device (e.g., FF) System (e.g., Crash) Circuit Masking MA/SW Masking Transient Faults Soft Errors System Failures Processor Pipeline In the storage device like main memory and caches, bit flips = TF = SE On the other hand, TF can become active as soft errors after masked with some effects like circuit masking. Similarly, SE can cause system failures such as crashes.

15 Approaching the Soft Error Problem
Hardware Interface Visible: Processor Hardware Chip design Obscured: Application behavior Soft Error Perspective: Fault  Soft Error (bit-flip) Protection: Correct the Soft Error

16 Processor Hardware Layers
Device Level Circuit Level Chip Packaging Processor Microarchitecture

17 Hardware Techniques Transient faults are incident on the hardware bit
Transistor and gate is affected by transient faults Attack the problem at its source Packaging techniques to shield the transistors Refining packaging material Circuit and device techniques to make them resilient to transient faults SOI (Silicon on Insulator), Fault resilient (hardened) latches. Architecture techniques to detect and correct bit-faults Parity (detect only) ECC (1-bit detect and correct) SECDED (1-bit detect and correct, 2-bit detect)

18 Soft Error Masking Effects in Electronic Circuitry – Electrical Masking
Pulse attenuated by electrical resistance in the circuit Pulse still strong enough to be latched at output

19 Soft Error Masking Effects in Electronic Circuitry – Logical Masking
Value unchanged at the gate

20 Soft Error Masking Effects in Electronic Circuitry – Logical Masking
Error propagated to the output

21 Soft Error Masking Effects in Electronic Circuitry – Temporal Masking
Transient Fault Soft Error A transient pulse at the latching window: Before tsetup  masked (not latched) After tsetup, Before thold  race condition At the latching window  not masked (latched) [Firouzi ROCS 2010]

22 Hardware Techniques for Protection
Shielding at the package-layer Method to prevent strikes Limited by packaging design/cost and technology available Device level Techniques Scalability, design flexibility and cost are concerns Fabrication cost governs commercial applicability Circuit Level Techniques Masking Effects : Electrical Masking Temporal Masking Logical Masking Circuit implementations enhance the masking effects to protect SEU Limitations: Hardware overheads of area / power and design cost Overhead vs Need governs commercial applicability of methods

23 Approaching the Soft Error Problem
Software Interface Visible: Application behavior Obscured: Processor Microarchitecture Chip design Soft Error Perspective: Anomaly in program execution Protection: Recover from incorrect execution

24 How do Soft Errors Manifest in the System?
ESWEEK 2012 Tutorial Presentation 9/21/2018 Outcomes from a Soft Error Output Data Corruption Incorrect Program Execution System Crash Silent undetected data-corruption Masked Soft Errors Correct Program execution No system Crash Application/Software Random charged-particles causing bit-errors in h/w components Compiler PROCESSOR Program Output Executable Binary Not all bit-errors in the processor hardware translate into system level errors (or Failures). -- The reason is Masking

25 Software-level Masking Effects - Logical/Arithmetic Masking
Program If A > 0 Block 1 Then Block 2 EndIf Scenario 1 A = 5 (0x0101) If A > 0 Block 1 Then Block 2 EndIf Scenario 2 A = 0 (0x0000) If A > 0 Block 1 Then Block 2 EndIf A = 7 (0x0111) A = 2 (0x0010) Expected Block 1 executed. Expected Block 2 is NOT executed.

26 Software-level Masking Effects - Control Flow Masking
Scenario 1 Scenario 2 Program B = 5 C = 10 If A > 0 B = B + 5 Then C = C + 2 EndIf A = 5 A = 0 B = 5 C = 10 If A > 0 B = B + 5 Then C = C + 2 EndIf B = 5 C = 10 If A > 0 B = B + 5 Then C = C + 2 EndIf Error in B manifests into Failure Error in B is Masked

27 Software-level Masking Effects
Other Masking Effects Data-value masking: If the value of one variable in a multiplication is 0, error in the other variable is masked. Error in a variable which may be reset for future use in the program is masked. e.g., error in the index variable of a for loop. Dynamic dead-code: Error in a variable, when used in a computation block results in incorrect temporary data. But if this temporary data computed is not used in the computation of program output or execution, this computed block of code is dynamically dead.

28 Transient Faults, Bit Flips, Soft Errors, etc.
activation propagation FAULT ERROR FAILURE fault latency error latency Storage Device (e.g., Memory, Cache, Registers) Bit Flips = Transient Faults = Soft Errors Logical Device (e.g., ALU) Sequential Device (e.g., FF) System (e.g., Crash) Circuit Masking MA/SW Masking Transient Faults Soft Errors System Failures Processor Pipeline In the storage device like main memory and caches, bit flips = TF = SE On the other hand, TF can become active as soft errors after masked with some effects like circuit masking. Similarly, SE can cause system failures such as crashes.

29 Software Techniques for Protection
The key ideas include: Reduce time that vulnerable data resides on the components Detect and correct errors (if any) after execution through the components. Software based techniques vary based on the components protected (coverage). L1 Cache Register File protection Pipeline core and buffers Redundancy based techniques Control flow based techniques

30 Research Front Error Detection and Recovery Techniques
Recovery through re-start may be an acceptable option Cost of integrated recovery mechanism is substantial Recovery through re-start may not be acceptable for HPC systems Advantage of Compiler-based methods Can analyze soft errors that transcend the microarchitecture and software level masking effects Can implement smart optimizations for efficient protection Limitations of Software approaches Granularity of optimizations is larger and therefore is a limitation Added code for detection/correction, are again vulnerable to soft errors and therefore may cancel out the protection acheived.

31 Approaching the Soft Error Problem
Hardware-Software Interface Visible: Processor Hardware Application behavior Obscured: Chip Design Soft Error Perspective: Fault  Soft Error (bit-flip)  Failure Protection: Protect against Soft Error induced Failures

32 How to Estimate Soft Errors ?
Soft Error : A data bit-flip during program execution that translates into erroneous output. Processor Pipeline Register File ? Application Binary Output Output Cache (Instruction/ Data) Buffers Analysis is most relevant at the interface between : A data-bit and A sequential element

33 Data Vulnerability & Soft Errors
Data processed by the system is exposed by the hardware components in the processor, to: charged-particles that strike the processor and other sources of transient errors - electrical noise, cross-talk, etc. Exposed data is vulnerable to bit-flips, that manifest into failures in the system as, Erroneous output System failure Output Data errors The probability of a bit-error manifesting into soft errors α Time duration data is exposed in h/w An error on an exposed data-bit in the processor, could lead to system errors if, the bit will be used in the process execution in the system. Only the exposed and actively-used data-bits in the system are deemed vulnerable during process execution.

34 Vulnerability in the Cache
time I R Instruction Cache E t0 t1 t2 t3 t4 t5 t6 t7 time I W R E Data Cache t0 t1 t2 t3 t4 t5

35 Vulnerability Distribution in the processor with protected Cache

36 Code Transformations for Vulnerability Reduction
ESWEEK 2012 Tutorial Presentation 9/21/2018 [Shrivastava et al 2010] Loop Interchange on Matrix Multiplication Vulnerability trend not same as performance Interesting configurations exist, with either low vulnerability or low runtime. 52X variation in vulnerability for 1% variation in runtime Opportunities may exist to trade off little runtime for large savings in vulnerability

37 Vulnerability depends on the data access pattern
for ( i : 0 ≤ i < N ) { for ( k : 0 ≤ k < N ) { for ( j : 0 ≤ j < N ) { A[i][k] += B[i][j] * C[j][k] } for ( i : 0 ≤ i < N ) { for ( j : 0 ≤ j < N ) { for ( k : 0 ≤ k < N ) { A[i][k] += B[i][j] * C[j][k] } Low Vulnerability But Bad Performance High Vulnerability But Good Performance Completely compute A[i][k] in the innermost loop  Less lifetime of A[i][k] Need A[i][k] across iterations of outermost loop  Longer lifetime of A[i][k] 9/21/2018

38 Soft Error Protection at H/w-S/w interface
Vulnerability of data is directly proportional to the probability of Soft Error Failures Fault Injection Simulation Static Estimation Estimation of Vulnerability is a key factor for: Comparative analysis of two protection designs Design space exploration of architecture designs

39 Interesting Research Questions
How to design cross-layer protection ? What component should be protected and which layer? How to co-ordinate among the techniques across layers? Evaluation Metric for Soft Error Protection No comprehensive system-level estimation method Compiler has immense potential to contribute but lacks comprehensive static estimation methods. Time based vulnerability in systems Vulnerability of an application is not static through time. Vulnerability of Multi-core systems Inter-thread communication affects vulnerability analysis Soft Error Protection in HPC Systems Communication overhead limits use of multi-core techniques

40 Thank You !


Download ppt "Towards the design of tomorrow’s Reliable Computing Systems"

Similar presentations


Ads by Google