Performance Implications of Faults in Prediction Arrays Nikolas Ladas Yiannakis Sazeides Veerle Desmet University of Cyprus Ghent University DFR’ 10 Pisa, Italy - 24/1/2010 HiPEAC2010
2 Motivation ● Technology scaling: Opportunities and Challenges ● Reliability and computing tomorrow ● Failures will not be exceptional ● Various sources of failures ● Manufacturing: imperfections, process-variation ● Physical phenomena: soft-errors, wear-out ● Power constraints: control operation below Vcc-min ● Key challenge: provide reliable operation with little or no performance degradation in the presence of faults with low- overhead solutions Nikolas Ladas 24/1/2010
3 Architectural vs Non-Architectural Faults ● So far research mainly focused on correctness ● Emphasis architectural structures, e.g. caches, registers, buses, alus etc ● However, faults can occur in non-architectural structures, e.g. predictor and replacement arrays ● Faults in non-architectural structures may degrade performance ● Not issue for soft-errors ● Can be problem for persistent faults: wear-out, process- variation, operation below Vcc-min Nikolas Ladas 24/1/2010
4 Non-architectural Resources Arrays line predictor branch direction predictor return-address-stack indirect jump predictor memory dependence prediction way, hit/miss, bank predictors replacement arrays (various caches) hysteresis arrays (various predictors)... Non-Arrays branch target address adder memory prefetch adder.... EV6 like core array bits breakdown Nikolas Ladas 24/1/2010
5 This talk… ● Quantify performance implications of faults in non- architectural array-structures ● Identify which non-architectural array-structures are the most sensitive to faults ● Do we need to worry about protecting these structures? Nikolas Ladas 24/1/2010
6 Outline ● Fault model / Experimental framework ● Performance implications of faults when all non- architectural arrays are faulty ● Criticality of the non-architectural arrays studied ● Fault semantics ● Conclusions and future direction Nikolas Ladas 24/1/2010
7 Faults and Arrays Faults may occur in different parts of an array We only consider cell faults Nikolas Ladas 24/1/2010
8 Array Fault Modeling Key Parameters Number of faults: consider % of cells that are faulty: and 0.5 Understand performance trends with increasing number of faults Fault Locations consider random fault locations each affecting 1 cell Try to capture average behavior Model for each fault each faulty cell randomly set at either stuck-at-1 or stuck-at-0 Nikolas Ladas 24/1/2010
9 Processor Model EV7 like processor with 15 stage pipeline 4-way ooo, mispredictions resolved at commit Non-Architectural Arrays Considered Line Predictor Array: 4K entries, 11 bits/entry Line Predictor Hysteresis Array: 4K entries, 2 bits/entry LRU array for 2-way 64KB 64B/block I$ : 512 entries, 1 bit/entry LRU array 2-way 64KB 64B/block D$ : 512 entries, 1 bit/entry Gshare Direction Predictor: 32Kentries, 2bits/entry Return address stack:16 entries, 31bits/entry Memory dependence predictor (load-wait)1024 entries, 1 bit/entry sim-alpha simulator SPEC CPU 2000 benchmarks – 100 M instructions Representative regions Nikolas Ladas 24/1/2010
10 Experiments Baseline performance: runs with no faults For experiments with faults: For each run all arrays with faults have same % of faulty bits 0.125, 0.5 ALL experiments are performed using the same 100 randomly generated fault maps (50 for each % of faulty bits) 0.125%0.5% Gshare Direction Predictor bits : Line Predictor Array bits : Line Predictor Hysteresis Array 8192 bits : 1041 Memory dependence predictor 1024 bits :15 2-way 64KB 64B/block I$ LRU array 512 bits : 13 2-way 64KB 64B/block D$ LRU array 512 bits : 13 Return address stack 496 bits :13 Nikolas Ladas 24/1/2010
11 Performance with 0.125% Faulty Bits (all arrays faulty) Nikolas Ladas 24/1/2010
12 Performance with 0.5% of Faulty Bits (all arrays faulty) Nikolas Ladas 24/1/2010
13 Observations with all arrays faulty Performance degradation substantial even with small % of faulty bits Both INT and FP benchmarks can degrade Average degradation 1%3.5% Max degradation39%53% Degradation is benchmark specific Instruction mix (different number and type of vulnerable instructions) Programs with high accuracy more vulnerable than those with low accuracies When few arrays entries accessed by a program it takes large number of faults to have faulty entries accessed Some benchmarks are memory dominated Worst-case degradation much greater than average Will cause performance variation between otherwise identical cores/chips Are all bits equally vulnerable? Which unit(s) matter the most? Nikolas Ladas 24/1/2010
14 Performance for Each Structure (0.125% faulty bits) 26 benchmarks x 50 experiments for each section Nikolas Ladas 24/1/2010
15 Performance for Each Structure (0.5% faulty bits) 26 benchmarks x 50 experiments for each section Nikolas Ladas 24/1/2010
16 Observations For the processor configuration used in this study the various non-architectural units are not equally vulnerable to same fraction of faults. RAS and BPRED are the most sensitive to faults Line predictor and load-wait predictor degrade performance significantly when there are 0.5% faults 2-way I$ and D$ are not sensitive even at 0.5% of faults in the LRU array Nikolas Ladas 24/1/2010
17 Reasons for Variable Vulnerability across units ● Semantics of faults vary across unit ● Some faults cause flushing the pipeline, others delay the execution of an instruction, others cause a one-cycle bubble ● Faults causing delays can be less severe since they can be hidden in the shadow of a misprediction or with ooo ● Units with typically higher accuracy more vulnerable (RAS and conditional predictor) ● Even within a unit faults can have different semantics Nikolas Ladas 24/1/2010
18 Semantics of Faults for a 2-bit Replacement StateAction 0xReplace 1xNo replace 0/1Stack-at value 00 R 11 N 10 N 01 R 0 00R00R 01R01R 1 11N11N 10N10N 1 11N11N 01R01R 0 00R00R 10N10N 00 R 11 N Always ReplaceNever Replace 01 R 10 N Nikolas Ladas 24/1/2010
19 Repair mechanism: XOR Remapping Access map Fault map XOR 1 Access map: counts access/entry during an interval Fault Map: indicates which entries are faulty (can be determined at manufacturing test or at very coarse intervals using BIST) Remap the index using XOR to minimize faulty accesses At regular intervals search for the optimal XOR value using the access map and fault map After remapping Faulty accesses: Nikolas Ladas 24/1/2010
Results 26 benchmarks x 10 fault maps per category Recovers most of the performance degradation Possible to make things worse if we remap when there is no need 20 Nikolas Ladas 24/1/2010
21 Summary-Conclusions ● Faults in non-architectural arrays can degrade processor performance ● Not all faults are equally important. Fault semantics vary. ● RAS and conditional branch predictor the most critical ● Faults can cause performance non-determinism across otherwise identical chips or within the cores of the same chip Nikolas Ladas 24/1/2010
22 Future Work ● Develop analytical model to predict the performance distribution for a given failure rate ● Understand implications of faults for other architectural and non-architectural structures Nikolas Ladas 24/1/2010
23 Acknowledgments Costas Kourougiannis Funding: University of Cyprus, Ghent University, HiPEAC, Intel Nikolas Ladas 24/1/2010
24 Thanks!
25 BACKUP SLIDES
26 Fault Semantics Line Predictor Array: incorrect prediction Conditional, returns get corrected within a cycle, indirects are resolved much later Line Predictor Hysteresis Array: Always update prediction on a misprediction Never update 2-way 64KB 64B/block I$ and D$ LRU arrays Converts sets with faulty LRU bit to direct mapped sets, more misses but can hide Gshare Direction Predictor faulty entries always predict taken or always not-taken Incorrect prediction that gets resolved late (25% chance been lucky) Return address stack Return misprediction is resolved late Memory dependence predictor (load-wait) Independent load wait (common case we should not wait) can partially hide Dependent load not wait (this should rarely be a serious problem) Nikolas Ladas 24/1/2010
27 Processor Pipeline 27
28 Line predictor Logical structure 28
29 Remapping Issues Remapping overhead: time to find the best remapping has a penalty on performance (cold effects), but this is acceptable because Remapping is performed every 500 K instruction intervals Once the best remapping is found, possible no need to remap again for a while No need to consider all possible xor-remappings Cost/Energy Optimizations Update access map and search for new remap-vector only if there is a need: count number of defective accesses and check if above threshold Consider one unit at a time (share access map across units) Remapping function: XOR remapping is in the critical path, we use a simple remapping function to minimize the overhead in hardware
30 Methodology: Performance Implications of Faults Worst-case Faults were injected on the most frequently used entries Most-used entry: provided most correct predictions for execution without faults Average Impossible to do experimentally too many combinations Random : faults are injected at random entries
31 Functional Faults and Array Logical View Not practical to study faults at physical level Functional Models: Abstractions that ease study of faults Fault locations: cell, input address, input/output data We only consider cell faults
32 BIST for Detecting Faults and Updating Fault Map
33 Example Remapping Search Algo
34 Interleaved vs Non-Interleaved Design Style (1) Each array wordline contains many entries Entries in the physical implementation are bit-interleaved More area efficient
35 Interleaved vs Non-Interleaved Design Style (2) But a cluster faults affects more entries in interleaved design For architectural structures: Soft-errors prefer interleaved Hard-errors: map to spare/disable block/set For non-architectural structures: Soft-errors – no need for protection Hard-errors: prefer non-interleaved (if area not issue)
36 4K LP:No Interleaving vs Interleaving (average random)
37 Random results without and with remapping
38 Expected Invariants With increasing faults more performance degradation Frequently accessed entries more critical than less accessed entries Cell stuck-at-1 more critical if bits stored in the cell are biased towards zero
39 Worst-case - Hit rate
40 Random results without and with remapping