Cost-Efficient Soft Error Protection for Embedded Microprocessors

Name: Cost-Efficient Soft Error Protection for Embedded Microprocessors
Uploaded: 2017-10-03T20:23:58+00:00
Duration: PTM18S54
Description: Cost-Efficient Soft Error Protection for Embedded Microprocessors

Cost-Efficient Soft Error Protection for Embedded Microprocessors
Jason Blome1, Shuguang Feng1, Shantanu Gupta1, Scott Mahlke1, Daryl Bradley2 University of Michigan1 ARM, Ltd. 2 This work was done in collaboration with Daryl Bradley from ARM Ltd. Along with Shuguang Feng, Shantanu Gupta and Scott Mahlke from The University of Michigan. 1

The Soft Error Problem CLK 1 Q D transient fault soft error 2
CLK D Q 1 transient fault soft error To begin, we’re going to start with a little bit of background on the soft error problem, what it is, and some projected trends. A soft error is a transient piece of incorrect hardware state, also known as a single event upset, or a transient error. Soft errors can be caused by a number of phenomenon ranging from electrical noise such as crosstalk, to high energy particle strikes caused by radiation from the atmosphere or semiconductor packaging materials. In this work we’re going to focus on the projected trends for soft errors caused by radiated particles, however the detection and correction techniques presented in this work are not particular to any specific cause. In general, a soft error can occur in a number of ways. For example, a high energy particle can strike a sequential state element and potentially invert the value stored in that element. Another way that a soft error may occur is if a particle strikes a combinational logic node, causing an incorrect value to be temporarily present at the output of that node. If this incorrect value then propagates to a state element and is stored, again, we have a soft error. In this work, we refer to the incorrect value within combinational logic a transient fault, and an incorrect stored value a soft error. There are also a number of natural mechanisms within a semiconductor device that may potentially mask the effects of a transient fault. We’ll briefly discuss these mechanisms now. 2

Fault Masking Architectural/Software: incorrect state is written before it is read Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit Latching-Window: the fault pulse does not reach a state element within the latching window Logical: faulted value does not affect logical operation of the circuit mov r5, 8 mov r2, 4 - … decoder Register File 1 2 3 4 5 add r6, r2, r5 mov r2, 4 CLK tsetup thold mov r5, 8 4 add r6, r2, r5 The first mechanism presented here is logical masking, and this occurs when the output of a combinational circuit is unaffected by a transient fault within the circuit. So here we have a fault occurring, where the faulted value is logically ANDed with 0, thus masking the effects of the transient fault. The next fault-masking mechanism is latching-window masking. Latching-window masking occurs when the transient fault does not propagate to a state element for the required setup/hold time window. Electrical attenuation occurs as a result of the electrical properties of a chain of logic gates, where each gate may potentially reduce the severity of a voltage spike. Lastly, architectural masking occurs when an erroneous value is overwritten before it is read. For example, in this code sequence, we have a fault occurring in r5, in the first cycle, but because the value in r5 is overwritten in the second cycle, the result of the add instruction in the third cycle is unaffected. In this work we model and study the effects of logical, latching-window, and architectural masking, but do not account for the effects of electrical attenuation. 8 9 3

Soft Error Rate Contributions
Soft Error Rate Trends Soft Error Rate Contributions The graph on the left is data presented by Subhashish Mitra from Intel that breaks down the contribution to the overall soft error rate of a high-performance microprocessor of different design elements. The yellow portion of this chart is the contribution from sequential state elements such as registers and latches, the portion in purple is the contribution from unprotected SRAMs, and the portion in blue is the contribution from combinational logic. On the right is a graph presented by Shivakumar from UT-Austin which predicts that the SER for combinational logic will increase dramatically in the next few technology generations while the SER for SRAM cells and sequential elements are expected to remain relatively constant. Now, relating this to the embedded design space, we would expect the blue section to be a much more significant portion of the total soft error rate, simply because embedded designs are not nearly so aggressively pipelined, leading to a larger ratio of combinational logic to sequential state elements. Further, as shown in the graph on the right, the SER of logic is expected to increase over the next few technology generations, which also would broaden the effects of faults in combinational logic. The important point to take away here is that there is that the effects of faults in combinational logic are expected to become much more significant, however, most soft error solutions focus simply on well structured SRAMs and state arrays, leaving designs vulnerable to faults in logic. Further, the most common techniques for protecting against faults in logic require structural duplication, which is prohibitively expensive within the embedded domain. Mitra 2005 Shivakumar 2002 Increasing contribution of faults in combinational logic to the overall soft error rate 4

Outline Soft error analysis setup Summary of fault analysis results
Fault tolerance techniques Register value cache Strategic deployment of fault detectors Conclusion 5

Fault Analysis Framework
Register Bank Data Interface Instruction Address Logic Data Multiply ALU Shift Instruction Decode ARM926EJ-S Instruction Fetch cache MMU Bus Interface Write Buffer/ Mux Array testbench reference design test report generation benchmark fault injection/error analysis framework error checking and logging fault injection scheduler In this work we conducted our experiments using a verilog model of an ARM926EJ-S microprocessor core. The ARM926 is a Harvard architecture consisting of a standard five-stage pipeline with 4KB instruction and data caches. This model was synthesized using an Artisan cell library characterized for a 130nm process with a maximum clock frequency of 200MHz. The chart in the top right corner depicts the area consumption of different design elements within the processor. This chart shows that while the core area is dominated by SRAM arrays, the area consumed by combinational logic is greater than that consumed by sequential state elements. Next we have a high-level diagram of the fault analysis framework used to study the effects of soft-errors within the processor core. In this framework, a testbench instantiates two identical copies of the ARM926 processor core. At the beginning of each simulation, the fault injection scheduler schedules a time at which to inject a fault into one of the cores. If the experiment is meant to model the effects of faults occurring in state elements, a random clock cycle time is selected for fault injection, and at the beginning of that cycle, the output of a randomly selected register is inverted for the duration of the clock cycle. If the experiment is meant to model the effects of faults occurring in combinational logic, a random time instant is selected, as well as a random duration on the interval of ¼ of a clock cycle to an entire clock cycle for fault injection. The output of a random logic gate is then selected at fault injection time and it’s outputs are inverted. Since the fault injection time and duration are uniformly random and ignorant of clock-cycle boundaries, these experiments are used to measure both logical masking and latching-window masking. After a fault is injected into the system, at each subsequent positive edge of the clock signal, every register within the processor core is compared with same register in the second core. If a mismatch occurs, the location of the error and the clock cycle time are logged. 6

Observed Error Rates At the software interface, error rates within 3%
Faults Occurring in Registers Error Site Error Rate Microarchitectural State 94% Architectural State 7% 94% 16% 7% 4% Faults Occurring in Combinational Logic Error Site Error Rate Microarchitectural State 16% Architectural State 4% In our first experiment we demonstrate the amount of observed logical masking within the 926 design while running an image processing kernel representing a typical workload on an embedded design. For this experiment we load the instruction memory with the benchmark, set the pc appropriately and allow the test and reference designs to execute for 3000 cycles. Then, a uniform random distribution is used to select a within the subsequent 3000 cycles at which point a fault should be injected. The fault injection site may either be a logic node or a state element and the injection site is selected based on a random distribution over the set of either logic gates or state elements. In this experiment a fault is modeled as a logic state inversion, occurring at the cycle boundary and lasting for the duration of an entire clock cycle. Here we show the observed logical masking rates for microarchitectural state, architectural state, and top-level ports in the design. Where the masking rate is simply the inverse of the error rate. The microarchitectural state is the set of all registers within the design, whereas the architectural state consists of the 31, 32 bit GPRs and 6 status registers defined by the ARM ISA. The top level ports are the entry and exit points for data in and out of the core. We can see that the microarchitectural masking rate for faults occurring in state elements is considerably less than for faults occurring in combinational logic. Meaning that when a fault occurs at a state element, it is much more likely to be expressed in the subsequent cycle. However, the rates of errors occurring at the software-visible level, in architectural state and at top-level ports, potentially sending incorrect data out on the memory bus, typically only differ by about 6%. Also interesting to note here is the average number of bits corrupted when an error is observed. When a single fault is injected into a state element, it is typically expressed as a single microarchitectural bit error, whereas when a fault occurring in logic is expressed, it typically causes multiple state elements to hold incorrect values. At the software interface, error rates within 3% 7

Impact of Fault Injection
8

Targeting the Faults that Count
ARM926EJ-S register file consumes 8.7% of total core area Responsible for 57.4% of architectural errors Register file area dominated by combinational logic ECC cost, efficacy? 9

The Register Value Cache
Register File 1 Read/Write Addr/Data 2 decoder 3 Read Result 4 5 … Register Value Cache Read/Write Values tee’d off before the register file - can catch faults in either the register file or the cache, not both 1 CMP 2 x 3 Stall/ Check CRC CMP 4 x 5 … CMP 10

The Register Value Cache
Index Array Valid Value Array Read/Write Addr Read Data Previous Read Values Write Data CRC CMP Write Data CRC Error Check Operation Write Operation Read Operation Error 11

Example Register File mov r2, 4 mov r2, 4 4 mov r5, 8 mov r5, 8
- mov r2, 4 mov r2, 4 1 - 4 2 - 4 decoder 3 - mov r5, 8 mov r5, 8 4 - 8 5 - add r3, r2, r5 add r3, r1, r4 … Register Cache 4 crc - - 1 8 crc - - 4 x 2 - - Check CRC 3 - - 8 4 5 x … 12

RVC Fault Coverage 57.4% 13

RVC Overhead 14

What About the Rest? Leverage fault fanout to place detectors at likely targets 15

Fault Fanout 16

Transient Fault Detector
Main Flip-Flop Main Flip-Flop Q CLK Shadow Latch Shadow Latch Error Delay We want to detect these events when they happen. We don’t want to place these detectors everywhere… A Self-Tuning DVS Processor Using Delay-Error Detection and Correction: S. Das 2006 17

Glitch Detector Coverage
Power Area Coverage Coverage Percent Overhead Percent Overhead 18

Combined Technique Coverage
Power Area Coverage Coverage Percent Overhead Percent Overhead 19

Conclusion Circuit level soft error analysis offers significant insight Faults in combinational logic do not require structural duplication Coverage versus cost tradeoffs available Significant benefits in compromise 85% fault coverage for only 5.5% area 2-3x increase in MTTF 20

Questions? 21

RVC Hit Rates 22

Cost-Efficient Soft Error Protection for Embedded Microprocessors

Similar presentations

Presentation on theme: "Cost-Efficient Soft Error Protection for Embedded Microprocessors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cost-Efficient Soft Error Protection for Embedded Microprocessors

Similar presentations

Presentation on theme: "Cost-Efficient Soft Error Protection for Embedded Microprocessors"— Presentation transcript:

Similar presentations

About project

Feedback