Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides University of Cyprus Ghent University 6 th HiPEAC Industrial Workshop Paris, 26/11/2008
6 th HiPEAC Industrial Workshop 2 Motivation Technology scaling: Opportunities and Challenges Reliability and computing tomorrow Failures will not be exceptional Various sources of failures soft-errors, process-variation, wear-out, hardware and software bugs Key challenge: provide correct operation with little or no performance degradation in the presence of faults with low- cost solutions
6 th HiPEAC Industrial Workshop 3 Architectural vs Non-Architectural Faults So far research mainly focused on correctness Emphasis architectural structures, e.g. caches, registers, buses However, faults can occur in non-architectural structures, e.g. predictor and replacement arrays Faults in non-architectural structures may degrade performance
6 th HiPEAC Industrial Workshop 4 Non-Architectural Faults: Why care? Miss deadlines: unacceptable for real time applications Non-architectural resources cover significant fraction of the active area of modern cores where temperature is higher more susceptible to wear-out and process variation faults If architectural resources protected, with increasing fault frequency/chip eventually non-architectural resources will become a performance bottleneck
6 th HiPEAC Industrial Workshop 5 This talk… Quantifies performance implications of faults in a non- architectural array-structure, specifically a line predictor Introduces and evaluates a simple detection scheme and repair technique to protect it against faults
6 th HiPEAC Industrial Workshop 6 Outline Fault Modeling Arrays background Performance Implications of Faults in a line predictor Detection - Repair Mechanisms Results Conclusions and Future Direction Work in progress…
6 th HiPEAC Industrial Workshop 7 Array Fault Modeling Key Parameters Number of faults with increasing faults higher potential for performance degradation Location of Faults frequently accessed entries more critical, output bit more serious Fault Clustering Granularity/“radius” of faults Model for each fault e.g. cell stuck-at-1 more critical if bits stored in the cell are biased towards zero
6 th HiPEAC Industrial Workshop 8 Non-architectural Resources Arrays line predictor branch direction predictor return-address-stack indirect jump predictor memory dependence prediction replacement arrays (various caches)... Non-Arrays branch target address adder memory prefetch adder....
6 th HiPEAC Industrial Workshop 9 Worst-case performance (cell faults) up to 27%
6 th HiPEAC Industrial Workshop 10 Worst-case - Hit rate
6 th HiPEAC Industrial Workshop 11 Detection and Repair Possible to consider previously proposed techniques for architectural arrays BUT detection and correction for non-architectural arrays does not have to be exact and provide full repair. Sufficient to minimize the performance effects of faults Our proposition: Address Remapping Exploit non-uniformity of accesses Observed experimentally that few entries in the line-predictor are accessed. So, the remapping has a wide range of entries to go.
6 th HiPEAC Industrial Workshop 12 (Sorted) Access Distributions for LP
6 th HiPEAC Industrial Workshop 13 accessed cells accessed defective cells not accessed cells not accessed defective cells Original Access-Fault MapRotate accesses down by 1 row 1 instead of 3 accessed faulty cells Proposed Approach for Remapping
6 th HiPEAC Industrial Workshop 14 accessed cells accessed defective cells not accessed cells not accessed defective cells Original Access-Fault Map Remap row accesses 1 instead of 3 accessed faulty cells Proposed Approach (for cell faults)
6 th HiPEAC Industrial Workshop 15 Detection and Repair Scheme
6 th HiPEAC Industrial Workshop 16 Index Remapping Unit original index XOR 1 value decided from search engine remapped index
6 th HiPEAC Industrial Workshop 17 Remapping Search Engine Access mapFault map
6 th HiPEAC Industrial Workshop 18 Remapping Search Engine Access mapFault map Defective_accessed A =Σ i (Access map i * Fault map) = =143
6 th HiPEAC Industrial Workshop 19 Remapping Search Engine Remapped AccessesFault map Best remapping = XOR 1(fewer defective accessed entries) Defective_accessed Β = Σ i (Access map i * Fault map) = 20+50=70
6 th HiPEAC Industrial Workshop 20 Simulator sim-alpha simulator EV6 processor with 15 stage pipeline Baseline configuration: No hard-fault, no remapping SPEC CPU 2000 benchmarks – 100 M instructions Representative regions We compare performance without and with remapping for random fault maps
6 th HiPEAC Industrial Workshop 21 Random results without and with remapping
6 th HiPEAC Industrial Workshop 22 Summary-Conclusions Reliability should not be limited on correctness but also consider performance Faults in non-architectural resources can degrade the performance of a processor and this may make them important to deal with Proposed framework for detection and repair: Detects the case where there we have many defective accessed entries Finds the best possible remapping Applies the remapping Remapping works very well in almost all cases
6 th HiPEAC Industrial Workshop 23 Future Work Experiments with other non-architectural structures, such as direction and indirect predictors and replacament arrays for I- cache, D-cache, TLB. Applicability of ideas to architectural structures.
6 th HiPEAC Industrial Workshop 24 Acknowledgments Elli Demetriou and Costas Vrionis Funding: University of Cyprus, Ghent University, SARC, HiPEAC, Intel
6 th HiPEAC Industrial Workshop 25 Thanks!
6 th HiPEAC Industrial Workshop 26 BACKUP SLIDES
6 th HiPEAC Industrial Workshop 27 Processor Pipeline 27
6 th HiPEAC Industrial Workshop 28 Line predictor structure 28
6 th HiPEAC Industrial Workshop 29 Remapping Issues Remapping overhead: time to find the best remapping has a penalty on performance, but this is acceptable because Remapping is performed every 100 K intervals Once the best remapping is found, the problem will be solved and there will be no need to remap again Design Space Remapping function: XOR Due to the fact that remapping is in the critical path, we use a simple remapping function to minimize the overhead in hardware
6 th HiPEAC Industrial Workshop 30 Methodology: Performance Implications of Faults Determine performance implications of faults in the LP and RAS for different scenarios Worst-case Faults were injected on the most frequently used entries Most-used entry: provided most correct predictions for execution without faults Average Impossible to do experimentally too many combinations Random : faults are injected at random entries
6 th HiPEAC Industrial Workshop 31 Random results without and with remapping
6 th HiPEAC Industrial Workshop 32 Faults and Arrays Faults may occur in different parts of an array Not practical to study faults at physical level
6 th HiPEAC Industrial Workshop 33 Functional Faults and Array Logical View Abstractions that ease study of faults Fault locations: cell, input address, output data