1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
2Reading HP3, Section
3 Why Do We Need Branch Prediction? Basic blocks are short, and we have done about all we can do for them with dynamic scheduling control dependences now become the bottleneck control dependences now become the bottleneck Since branches disrupt sequential flow of instrs… we need to be able to predict branch behavior to avoid stalling the pipeline we need to be able to predict branch behavior to avoid stalling the pipeline What we must predict Branch outcome (Is the branch taken?) Branch outcome (Is the branch taken?) Branch Target Address (What is the next non-sequential PC value?) Branch Target Address (What is the next non-sequential PC value?)
4 A General Model of Branch Prediction T: probability of branch being taken p: fraction of branches that are predicted to be taken A: accuracy of prediction j, k, m, n: associated delays (penalties) for the four events ( n is usually 0) Branch penalty of a particular prediction method Branch predictor accuracy Branch penalties
5 Theoretical Limits of Branch Prediction Best case: branches are perfectly predicted (A = 1) also assume that n = 0 also assume that n = 0 minimum branch penalty = j*T minimum branch penalty = j*T Let s be the pipeline stage where BTA becomes known Then j = s-1 Then j = s-1 See static prediction methods in Lecture 7 See static prediction methods in Lecture 7 Thus, performance of any branch prediction strategy is limited by s, the location of the pipeline stage that develops BTA s, the location of the pipeline stage that develops BTA A, the accuracy of the prediction A, the accuracy of the prediction
6 Review: Static Branch Prediction Methods Several static prediction strategies: Predict all branches as NOT TAKEN Predict all branches as NOT TAKEN Predict all branches as TAKEN Predict all branches as TAKEN Predict all branches with certain opcodes as TAKEN, and all others as NOT TAKEN Predict all branches with certain opcodes as TAKEN, and all others as NOT TAKEN Predict all forward branches as NOT TAKEN, and all backward branches as TAKEN Predict all forward branches as NOT TAKEN, and all backward branches as TAKEN Opcodes have default predictions, which the compiler may reverse by setting a bit in the instruction Opcodes have default predictions, which the compiler may reverse by setting a bit in the instruction Review material in Lecture 7
7 Dynamic Branch Prediction Premise: History of a branch instr’s outcome matters! whether a branch will be taken depends greatly on the way previous dynamic instances of the same branch were decided whether a branch will be taken depends greatly on the way previous dynamic instances of the same branch were decided Dynamic prediction methods: take advantage of this fact by making their predictions dependent on the past behavior of the same branch instr take advantage of this fact by making their predictions dependent on the past behavior of the same branch instr such methods are called Branch History Table (BHT) methods such methods are called Branch History Table (BHT) methods
8 BHT Methods for Branch Prediction
9 NT A One-Bit Predictor Predictor misses twice on typical loop branches Once at the end of loop Once at the end of loop Once at the end of the 1 st iteration of next execution of loop Once at the end of the 1 st iteration of next execution of loop The outcome sequence NT-T-NT-T makes it miss all the time State 0 Predict Not Taken State 1 Predict Taken T T NT
10 A Two-Bit Predictor A four-state Moore machine Predictor misses once on typical loop branches hence popular hence popular Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the time NT State 2 Predict Taken State 3 Predict Taken T T NT State 0 Predict Not Taken State 1 Predict Not Taken T NT T
11 A Two-Bit Predictor A four-state Moore machine Predictor misses once on typical loop branches hence popular hence popular Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
12 Correlating Branch Outcome Predictors The history-based branch predictors seen so far base their predictions on past history of branch that is being predicted A completely different idea: The outcome of a branch may well be predicted successfully based on the outcome of the last k branches executed The outcome of a branch may well be predicted successfully based on the outcome of the last k branches executed i.e., the path leading to the branch being predicted Much-quoted example from SPEC92 benchmark eqntott Much-quoted example from SPEC92 benchmark eqntott if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … } if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … } TAKEN(b1) && TAKEN(b2) implies NOT-TAKEN(b3)
13 Another Example of Branch Correlation if (d == 0) //b1 d = 1; if (d == 1) //b2... if (d == 0) //b1 d = 1; if (d == 1) //b2... Assume multiple runs of code fragment d alternates between 2 and 0 How would a 1-bit predictor initialized to state 0 behave? BNEZR1, L1 ADDIR1, R0, 1 L1: SUBIR3, R1, 1 BNEZR3, L2 … L2:
14 A Correlating Branch Predictor Think of having a pair of 1-bit predictors [p 0, p 1 ] for each branch, where we choose between predictors (and update them) based on outcome of most recent branch (i.e., B1 for B2, and B2 for B1) if most recent br was not taken, use and update (if needed) predictor p 0 if most recent br was not taken, use and update (if needed) predictor p 0 If most recent br was taken, use and update (if needed) predictor p 1 If most recent br was taken, use and update (if needed) predictor p 1 How would such (1,1) correlating predictors behave if initialized to [0,0]?
15 Organization of (m,n) Correlating Predictor Using the results of last m branches 2 m outcomes 2 m outcomes can be kept in m -bit shift register can be kept in m -bit shift register n -bit “self-history” predictor BHT addressed using m bits of global history m bits of global history select column (particular predictor) some lower bits of branch address some lower bits of branch address select row (particular branch instr) entry holds n previous outcomes entry holds n previous outcomes Aliasing can occur since BHT uses only portion of branch instr address state in various predictors in single row may correspond to different branches at different points of time state in various predictors in single row may correspond to different branches at different points of time m =0 is ordinary BHT 4 Branch address Global branch history Prediction 2-bit branch predictors 2
16 Improved Dynamic Branch Prediction Recall that, even with perfect accuracy of prediction, branch penalty of a prediction method is (s-1)*T s is the pipeline stage where BTA is developed s is the pipeline stage where BTA is developed T is the frequency of taken branches T is the frequency of taken branches Further improvements can be obtained only by using a cache storing BTAs, and accessing it simultaneously with the I-cache Such a cache is called a Branch Target Buffer (BTB) Such a cache is called a Branch Target Buffer (BTB) BHT and BTB can be used together Coupled: one table holds all the information Coupled: one table holds all the information Uncoupled: two independent tables Uncoupled: two independent tables
17 Using BTB and BHT Together Uncoupled solution BTB stores only the BTAs of taken branches recently executed BTB stores only the BTAs of taken branches recently executed No separate branch outcome prediction (the presence of an entry in BTB can be used as an implicit prediction of the branch being TAKEN next time) No separate branch outcome prediction (the presence of an entry in BTB can be used as an implicit prediction of the branch being TAKEN next time) Use the BHT in case of a BTB miss Use the BHT in case of a BTB miss Coupled solution Stores BTAs of all branches recently executed Stores BTAs of all branches recently executed Has separate branch outcome prediction for each table entry Has separate branch outcome prediction for each table entry Use BHT in case of BTB hit Use BHT in case of BTB hit Predict NOT TAKEN otherwise Predict NOT TAKEN otherwise
18 Parameters of Real Machines
19 Coupled BTB and BHT
20 Decoupled BTB and BHT
21 Reducing Misprediction Penalties Need to recover whenever branch prediction is not correct Discard all speculatively executed instructions Discard all speculatively executed instructions Resume execution along alternative path (this is the costly step) Resume execution along alternative path (this is the costly step) Scenarios where recovery is needed Scenarios where recovery is needed Predict taken, branch is taken, BTA wrong (case 7) Predict taken, branch is not taken (cases 4 and 6) Predict not taken, branch is taken (case 3) Preparing for recovery involves working on alternative parh On instruction level On instruction level Two fetch address registers per speculated branch (PPC 603 & 640) Two instruction buffers (IBM 360/91, SuperSPARC, Pentium) On I-cache level On I-cache level For PT, also do next-line prefetching For PNT, also do target-line prefetching
22 Predicting Dynamic BTAs Vast majority of dynamic BTAs come from procedure returns (85% for SPEC95) Since procedure call-return for the most part follows a stack discipline, a specialized return address buffer operated as a stack is appropriate for high prediction accuracy Pushes return address on call Pushes return address on call Pops return address on return Pops return address on return Depth of RAS should be as large as maximum call depth to avoid mispredictions Depth of RAS should be as large as maximum call depth to avoid mispredictions 8-16 elements generally sufficient 8-16 elements generally sufficient