Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level Timing Analysis Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering University of MichiganUniversity of Texas at Austin
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 2 Introduction Recently there is a growing concern about transient faults in combinational logic Numerous techniques already exist that deal with the effects of transient faults: – Error Correction Codes (ECC) – DIVA – Simultaneous Redundantly Threading (SRT) – and many other… However, these techniques come with a cost on performance, power, die size and design time.
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 3 Introduction Designers have to trade-off between reliability provided and implementation cost Inadequate soft-error protection maybe useless due to poor reliability Excessive soft-error protection uncompetitive in cost and/or performance In order to balance this trade-off, system designers need accurate SERs (Soft-Error Rate) for their designs The device community provides raw SERs for devices of current technologies and projections for devices of future technologies However, architecture-level and circuit-level phenomena derate the raw SER Accurately assessing a design’s SER requires circuit-level detail analysis infrastructure
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 4 In This Work… We introduce a high-fidelity, high-performance simulation infrastructure for estimating soft-error rates – asynchronously injects voltage pulses of various durations at the gate level – accurately gauge detailed circuit phenomena to model: fault introduction fault propagation and possible fault masking – simulates with sufficient speed permitting the examination of entire workloads on complex designs (thousands of gates)
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 5 Soft Error Masking Fortunately not all transient faults cause an error – Circuit and architectural phenomena prevent the fault from propagating to the design’s output and causing an error Logic masking Timing masking Electrical masking Microarchitecture masking Software masking
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 6 Soft Error Masking Logic Masking : Logic Masking : the fault gets blocked by a following gate whose output is completely determined by its other inputs Timing Masking : Timing Masking : the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input Electrical Masking : Electrical Masking : the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input Microarchitectural Masking : Microarchitectural Masking : the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output Software Masking : Software Masking : the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 7 Design Under Test: Design Under Test: gate-level description of the design (netlist) - Fault-Exposed Model: subjected to fault injection - Golden Model: no fault injected Fault Generator : Fault Generator : injects voltage pulses of various durations at any gate in the design and flips the value of any flip-flop in the design - faults are uniformly distributed at time, location and duration Simulation Infrastructure Fault Analyzer : Fault Analyzer : Monitors manifested errors and tracks all the possible ways a fault can be masked Model Stimuli Model Stimuli : Workload traces that exercise the design under test
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 8 Statistical Model for Transient Faults Pulse-based model for transient faults caused by energetic particle strikes Faults injected into combinational logic are classified based on their duration – 20%, 40%, 60%, 80% and 100% of design’s clock period Faults injected into sequential elements flip their value The arrival rate of each type of fault is modeled by a separate random variable The mean inter-arrival times for each fault type are derived by previously published data and detailed SPICE simulations
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 9 Design Under Test – CMP Switch We chose as a design under test a single chip multiprocessor interconnection switch (baseline provided by Li-Shiuan Peh) – Much less complex than a microprocessor yet not too simplistic (it includes finite state machines, buffers, control logic, and buses) Wormhole switch pipelined at the flit level Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements Realistic workload – Communication traces derived from the TRIPS architecture
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 10 Characterization per Fault Type High microarchitectural masking – 95% of the faults that flip a flip-flop’s value are masked Timing masking is significant only for faults with small pulse durations Logic masking is increasing as the fault’s pulse duration is decreasing 51.7% logic masking 2.2% timing masking 42.9% μarch masking 3.2% error
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 11 Derating Factor Derating factor = error rate -1 – i.e. a derating factor of 30 means that one of every 30 injected faults will cause an error (corresponds to an error rate of 3.3%) Average derating factor for realistic workloads is 31 Synthetic high utilization workload leads to a derating factor of 12 error rate: 3.2% error rate: 8.3%
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 12 Failure Rate Projections Taking into account projections from ITRS and raw SER estimates for future process technologies, we make failure rate projections considering the transient-fault derating effects Design architecture is kept intact for future process technologies Two different designs: – one clocked with the projected clock frequencies for microprocessors – and one clocked with the projected clock frequencies for interconnection networks
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 13 Transient-fault Vulnerability per Component We observed that each switch component exhibited different vulnerability on transient faults Derating effects greatly depend on the component’s characteristics Most vulnerable component – Switch Arbiter (12.8% error) – 6% of switch’s area Input Controllers – dominate switch design – 86% of switch’s area The switch’s vulnerability match with that of input controllers
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 14 Effects of Multi-fault Strikes A single strike causes multiple faults on neighbouring gates or flip-flops – lack of data about frequency of such events or models for multi-fault strikes on logic gates and flip-flops – we assume that each strike causes multiple faults extremely pessimistic – even under this severe environment the failure rates are relatively low
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 15 Conclusions – Directions for Future Work Conclusions For complex designs there is significant fault masking, with derating factors as high as 30 Soft-error derating effects highly depend on the design’s characteristics and utilization Our observations suggest that the soft-error reliability threat might have been overstated by the computer architecture community – Designers need to evaluate their design’s soft-error tolerance with detail analysis tools considering circuit level derating effects and better trade-off between the protection provided and the implementation cost Future Work Study the soft-error derating effects for several designs with different amount of complexity and different characteristics Enhance our simulation infrastructure to be able to simulate large high- complexity systems (millions of gates) with short simulation runs
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 16 Questions?