Performance Implications of Faults in Prediction Arrays Nikolas Ladas Yiannakis Sazeides Veerle Desmet University of Cyprus Ghent University DFR’ 10 Pisa,

Slides:



Advertisements
Similar presentations
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Clustered Indexing for Conditional Branch Predictors Veerle Desmet Ghent University Belgium.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Evaluation of the Gini-index for Studying Branch Prediction Features Veerle Desmet Lieven Eeckhout Koen De Bosschere.
Analysis of Branch Predictors
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Alpha Supplement CS 740 Oct. 14, 1998
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Prophet/Critic Hybrid Branch Prediction B B B
Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
PipeliningPipelining Computer Architecture (Fall 2006)
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
/ Computer Architecture and Design
ECE 445 – Computer Organization
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
CSC3050 – Computer Architecture
Patrick Akl and Andreas Moshovos AENAO Research Group
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Performance Implications of Faults in Prediction Arrays Nikolas Ladas Yiannakis Sazeides Veerle Desmet University of Cyprus Ghent University DFR’ 10 Pisa, Italy - 24/1/2010 HiPEAC2010

2 Motivation ● Technology scaling: Opportunities and Challenges ● Reliability and computing tomorrow ● Failures will not be exceptional ● Various sources of failures ● Manufacturing: imperfections, process-variation ● Physical phenomena: soft-errors, wear-out ● Power constraints: control operation below Vcc-min ● Key challenge: provide reliable operation with little or no performance degradation in the presence of faults with low- overhead solutions Nikolas Ladas 24/1/2010

3 Architectural vs Non-Architectural Faults ● So far research mainly focused on correctness ● Emphasis architectural structures, e.g. caches, registers, buses, alus etc ● However, faults can occur in non-architectural structures, e.g. predictor and replacement arrays ● Faults in non-architectural structures may degrade performance ● Not issue for soft-errors ● Can be problem for persistent faults: wear-out, process- variation, operation below Vcc-min Nikolas Ladas 24/1/2010

4 Non-architectural Resources  Arrays line predictor branch direction predictor return-address-stack indirect jump predictor memory dependence prediction way, hit/miss, bank predictors replacement arrays (various caches) hysteresis arrays (various predictors)...  Non-Arrays branch target address adder memory prefetch adder.... EV6 like core array bits breakdown Nikolas Ladas 24/1/2010

5 This talk… ● Quantify performance implications of faults in non- architectural array-structures ● Identify which non-architectural array-structures are the most sensitive to faults ● Do we need to worry about protecting these structures? Nikolas Ladas 24/1/2010

6 Outline ● Fault model / Experimental framework ● Performance implications of faults when all non- architectural arrays are faulty ● Criticality of the non-architectural arrays studied ● Fault semantics ● Conclusions and future direction Nikolas Ladas 24/1/2010

7 Faults and Arrays  Faults may occur in different parts of an array  We only consider cell faults Nikolas Ladas 24/1/2010

8 Array Fault Modeling Key Parameters  Number of faults: consider % of cells that are faulty: and 0.5 Understand performance trends with increasing number of faults  Fault Locations consider random fault locations each affecting 1 cell Try to capture average behavior  Model for each fault each faulty cell randomly set at either stuck-at-1 or stuck-at-0 Nikolas Ladas 24/1/2010

9 Processor Model EV7 like processor with 15 stage pipeline 4-way ooo, mispredictions resolved at commit Non-Architectural Arrays Considered Line Predictor Array: 4K entries, 11 bits/entry Line Predictor Hysteresis Array: 4K entries, 2 bits/entry LRU array for 2-way 64KB 64B/block I$ : 512 entries, 1 bit/entry LRU array 2-way 64KB 64B/block D$ : 512 entries, 1 bit/entry Gshare Direction Predictor: 32Kentries, 2bits/entry Return address stack:16 entries, 31bits/entry Memory dependence predictor (load-wait)1024 entries, 1 bit/entry sim-alpha simulator SPEC CPU 2000 benchmarks – 100 M instructions Representative regions Nikolas Ladas 24/1/2010

10 Experiments  Baseline performance: runs with no faults  For experiments with faults: For each run all arrays with faults have same % of faulty bits 0.125, 0.5 ALL experiments are performed using the same 100 randomly generated fault maps (50 for each % of faulty bits) 0.125%0.5%  Gshare Direction Predictor bits :  Line Predictor Array bits :  Line Predictor Hysteresis Array 8192 bits : 1041  Memory dependence predictor 1024 bits :15  2-way 64KB 64B/block I$ LRU array 512 bits : 13  2-way 64KB 64B/block D$ LRU array 512 bits : 13  Return address stack 496 bits :13 Nikolas Ladas 24/1/2010

11 Performance with 0.125% Faulty Bits (all arrays faulty) Nikolas Ladas 24/1/2010

12 Performance with 0.5% of Faulty Bits (all arrays faulty) Nikolas Ladas 24/1/2010

13 Observations with all arrays faulty Performance degradation substantial even with small % of faulty bits Both INT and FP benchmarks can degrade Average degradation 1%3.5% Max degradation39%53% Degradation is benchmark specific Instruction mix (different number and type of vulnerable instructions) Programs with high accuracy more vulnerable than those with low accuracies When few arrays entries accessed by a program it takes large number of faults to have faulty entries accessed Some benchmarks are memory dominated Worst-case degradation much greater than average Will cause performance variation between otherwise identical cores/chips Are all bits equally vulnerable? Which unit(s) matter the most? Nikolas Ladas 24/1/2010

14 Performance for Each Structure (0.125% faulty bits) 26 benchmarks x 50 experiments for each section Nikolas Ladas 24/1/2010

15 Performance for Each Structure (0.5% faulty bits) 26 benchmarks x 50 experiments for each section Nikolas Ladas 24/1/2010

16 Observations For the processor configuration used in this study the various non-architectural units are not equally vulnerable to same fraction of faults. RAS and BPRED are the most sensitive to faults Line predictor and load-wait predictor degrade performance significantly when there are 0.5% faults 2-way I$ and D$ are not sensitive even at 0.5% of faults in the LRU array Nikolas Ladas 24/1/2010

17 Reasons for Variable Vulnerability across units ● Semantics of faults vary across unit ● Some faults cause flushing the pipeline, others delay the execution of an instruction, others cause a one-cycle bubble ● Faults causing delays can be less severe since they can be hidden in the shadow of a misprediction or with ooo ● Units with typically higher accuracy more vulnerable (RAS and conditional predictor) ● Even within a unit faults can have different semantics Nikolas Ladas 24/1/2010

18 Semantics of Faults for a 2-bit Replacement StateAction 0xReplace 1xNo replace 0/1Stack-at value 00 R 11 N 10 N 01 R 0 00R00R 01R01R 1 11N11N 10N10N 1 11N11N 01R01R 0 00R00R 10N10N 00 R 11 N Always ReplaceNever Replace 01 R 10 N Nikolas Ladas 24/1/2010

19 Repair mechanism: XOR Remapping Access map Fault map XOR 1 Access map: counts access/entry during an interval Fault Map: indicates which entries are faulty (can be determined at manufacturing test or at very coarse intervals using BIST) Remap the index using XOR to minimize faulty accesses At regular intervals search for the optimal XOR value using the access map and fault map After remapping Faulty accesses: Nikolas Ladas 24/1/2010

Results 26 benchmarks x 10 fault maps per category Recovers most of the performance degradation Possible to make things worse if we remap when there is no need 20 Nikolas Ladas 24/1/2010

21 Summary-Conclusions ● Faults in non-architectural arrays can degrade processor performance ● Not all faults are equally important. Fault semantics vary. ● RAS and conditional branch predictor the most critical ● Faults can cause performance non-determinism across otherwise identical chips or within the cores of the same chip Nikolas Ladas 24/1/2010

22 Future Work ● Develop analytical model to predict the performance distribution for a given failure rate ● Understand implications of faults for other architectural and non-architectural structures Nikolas Ladas 24/1/2010

23 Acknowledgments  Costas Kourougiannis  Funding: University of Cyprus, Ghent University, HiPEAC, Intel Nikolas Ladas 24/1/2010

24 Thanks!

25 BACKUP SLIDES

26 Fault Semantics  Line Predictor Array: incorrect prediction Conditional, returns get corrected within a cycle, indirects are resolved much later  Line Predictor Hysteresis Array: Always update prediction on a misprediction Never update  2-way 64KB 64B/block I$ and D$ LRU arrays Converts sets with faulty LRU bit to direct mapped sets, more misses but can hide  Gshare Direction Predictor faulty entries always predict taken or always not-taken Incorrect prediction that gets resolved late (25% chance been lucky)  Return address stack Return misprediction is resolved late  Memory dependence predictor (load-wait) Independent load wait (common case we should not wait) can partially hide Dependent load not wait (this should rarely be a serious problem) Nikolas Ladas 24/1/2010

27 Processor Pipeline 27

28 Line predictor Logical structure 28

29 Remapping Issues  Remapping overhead: time to find the best remapping has a penalty on performance (cold effects), but this is acceptable because Remapping is performed every 500 K instruction intervals Once the best remapping is found, possible no need to remap again for a while No need to consider all possible xor-remappings  Cost/Energy Optimizations Update access map and search for new remap-vector only if there is a need: count number of defective accesses and check if above threshold Consider one unit at a time (share access map across units)  Remapping function: XOR remapping is in the critical path, we use a simple remapping function to minimize the overhead in hardware

30 Methodology: Performance Implications of Faults  Worst-case Faults were injected on the most frequently used entries Most-used entry: provided most correct predictions for execution without faults  Average Impossible to do experimentally too many combinations Random : faults are injected at random entries

31 Functional Faults and Array Logical View Not practical to study faults at physical level Functional Models: Abstractions that ease study of faults Fault locations: cell, input address, input/output data We only consider cell faults

32 BIST for Detecting Faults and Updating Fault Map

33 Example Remapping Search Algo

34 Interleaved vs Non-Interleaved Design Style (1)  Each array wordline contains many entries  Entries in the physical implementation are bit-interleaved More area efficient

35 Interleaved vs Non-Interleaved Design Style (2)  But a cluster faults affects more entries in interleaved design  For architectural structures: Soft-errors prefer interleaved Hard-errors: map to spare/disable block/set  For non-architectural structures: Soft-errors – no need for protection Hard-errors: prefer non-interleaved (if area not issue)

36 4K LP:No Interleaving vs Interleaving (average random)

37 Random results without and with remapping

38 Expected Invariants With increasing faults more performance degradation  Frequently accessed entries more critical than less accessed entries  Cell stuck-at-1 more critical if bits stored in the cell are biased towards zero

39 Worst-case - Hit rate

40 Random results without and with remapping