COMP 740: Computer Architecture and Implementation Montek Singh Oct 10, 2016 Topic: Instruction-Level Parallelism - I (Dynamic Branch Prediction)
Instruction-Level Parallelism Exploit parallelism that can be “squeezed out” of programs written sequentially overlap the execution of instructions improve performance! Requires sophisticated hw & sw hardware: help discover and exploit parallelism dynamically (at runtime) dominates the desktop/server markets (e.g., Intel Core series) but creeping into mobile devices also software: find parallelism statically at compile time scientific computing and also in personal mobile devices (hardware must be simpler, energy-efficient) Pipelining become universal around 1985 goal is now to go beyond basic pipelining
Exploiting ILP There are several techniques. Here’s a summary:
The first technique we will study for ILP Branch Prediction The first technique we will study for ILP
Why Do We Need Branch Prediction? Parallelism within basic block is limited Basic blocks are short: 3-6 instructions Control dependences can become the bottleneck Must optimize across branches Since branches disrupt sequential flow of instrs… we need to be able to predict branch behavior to avoid stalling the pipeline What must we predict? Two things: Branch Outcome Is the branch taken? Branch Target Address What is the next PC value?
A General Model of Branch Prediction Branch predictor accuracy Branch penalties T: probability of branch being taken p: fraction of branches that are predicted to be taken A: accuracy of prediction j, k, m, n: associated delays (penalties) for the four events (n is usually 0) Branch penalty of a particular prediction method
Theoretical Limits of Branch Prediction Best case: branches are perfectly predicted (A = 1) also assume that n = 0 minimum branch penalty = j*T Let s be the pipeline stage where BTA becomes known Then j = s-1 See static prediction methods in previous lecture Thus, performance of any branch prediction strategy is limited by s, the location of the pipeline stage that develops BTA A, the accuracy of the prediction
Review: Static Branch Prediction Methods Several static prediction strategies: Predict all branches as NOT TAKEN Predict all branches as TAKEN Predict all branches with certain opcodes as TAKEN, and all others as NOT TAKEN Predict all forward branches as NOT TAKEN, and all backward branches as TAKEN Opcodes have default predictions, which the compiler may reverse by setting a bit in the instruction
Dynamic Branch Prediction Premise: History of a branch instr’s outcome matters! whether a branch will be taken depends greatly on the way previous dynamic instances of the same branch were decided Dynamic prediction methods: take advantage of this fact by making their predictions dependent on the past behavior of the same branch instr such methods are called Branch History Table (BHT) methods
BHT Methods for Branch Prediction
A One-Bit Predictor T NT NT T State 1 Predict Taken State 0 Predict Not Taken T Predictor misses twice on typical loop branches Once at the end of loop Once at the end of the 1st iteration of next execution of loop The outcome sequence NT-T-NT-T makes it miss all the time
A Two-Bit Predictor A four-state Moore machine NT State 2 Predict Taken State 3 Predict Taken T State 0 Not Taken State 1 A four-state Moore machine Predictor misses once on typical loop branches hence popular Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
A Two-Bit Predictor A four-state Moore machine Predictor misses once on typical loop branches hence popular Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
Correlating Branch Outcome Predictors The history-based branch predictors seen so far base their predictions on past history of branch that is being predicted A completely different idea: The outcome of a branch may well be predicted successfully based on the outcome of the last k branches executed i.e., the path leading to the branch being predicted Much-quoted example from SPEC92 benchmark eqntott if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … } TAKEN(b1) && TAKEN(b2) implies NOT-TAKEN(b3)
Another Example of Branch Correlation if (d == 0) //b1 d = 1; if (d == 1) //b2 ... Assume multiple runs of code fragment d alternates between 2 and 0 How would a 1-bit predictor initialized to state 0 behave? BNEZ R1, L1 ADDI R1, R0, 1 L1: SUBI R3, R1, 1 BNEZ R3, L2 … L2:
A Correlating Branch Predictor Think of having a pair of 1-bit predictors [p0, p1] for each branch, where we choose between predictors (and update them) based on outcome of most recent branch (i.e., B1 for B2, and B2 for B1) if most recent br was not taken, use and update (if needed) predictor p0 If most recent br was taken, use and update (if needed) predictor p1 How would such (1,1) correlating predictors behave if initialized to [0,0]?
Organization of (m,n) Correlating Predictor Using the results of last m branches 2m outcomes can be kept in m-bit shift register n-bit “self-history” predictor BHT addressed using m bits of global history select column (particular predictor) some lower bits of branch address select row (particular branch instr) entry holds n previous outcomes Aliasing can occur since BHT uses only portion of branch instr address state in various predictors in single row may correspond to different branches at different points of time m=0 is ordinary BHT 4 Branch address Global branch history Prediction 2-bit branch predictors 2
Improved Dynamic Branch Prediction Recall that, even with perfect accuracy of prediction, branch penalty of a prediction method is (s-1)*T s is the pipeline stage where BTA is developed T is the frequency of taken branches Further improvements can be obtained only by using a cache storing BTAs, and accessing it simultaneously with the I-cache Such a cache is called a Branch Target Buffer (BTB) BHT and BTB can be used together Coupled: one table holds all the information Uncoupled: two independent tables
Branch Target Buffers optional field Figure 3.21 A branch-target buffer. The PC of the instruction being fetched is matched against a set of instruction addresses stored in the first column; these represent the addresses of known branches. If the PC matches one of these entries, then the instruction being fetched is a taken branch, and the second field, predicted PC, contains the prediction for the next PC after the branch. Fetching begins immediately at that address. The third field, which is optional, may be used for extra prediction state bits.
How BTB is used During instruction fetch: read BTB concurrently with instr. memory for most/recent taken branches, BTB immediately provides the branch target address (BTA)
Using BTB and BHT Together Uncoupled solution BTB stores only the BTAs of taken branches recently executed No separate branch outcome prediction (the presence of an entry in BTB can be used as an implicit prediction of the branch being TAKEN next time) Use the BHT in case of a BTB miss Coupled solution Stores BTAs of all branches recently executed Has separate branch outcome prediction for each table entry Use BHT in case of BTB hit Predict NOT TAKEN otherwise
Parameters of Real Machines
Instruction is a branch Coupled BTB and BHT Access BTB and I-cache Miss in BTB PNT (fetch inline) Hit in BTB Not a branch OK: zero penalty Case 1 Instruction is a branch (Next instr killed) Branch not taken Case 2 Branch taken Enter into BTB Case 3 Predict Not Taken (using BHT) Predict Taken Go to BTA stored in BTB Update BTB? Case 4 Update BTB Case 5 Case 6 Wrong BTA Case 7 Correct BTA Case 8
(Fetch inline, wait for opcode) Instruction is a branch Decoupled BTB and BHT Access BTB and I-cache Miss in BTB (Fetch inline, wait for opcode) Hit in BTB Predict Taken Go to BTA stored in BTB Not a branch OK: zero penalty Case 1 Instruction is a branch (Next instr killed) Predict Not Taken (using BHT) Branch not taken Case 2 Branch taken Enter into BTB Case 3 Case 4 Case 5 Remove from BTB Case 6 Wrong BTA Update BTB Case 7 Correct BTA Case 8
Reducing Misprediction Penalties Need to recover whenever branch prediction is not correct Discard all speculatively executed instructions Resume execution along alternative path (this is the costly step) Scenarios where recovery is needed Predict taken, branch is taken, BTA wrong (case 7) Predict taken, branch is not taken (cases 4 and 6) Predict not taken, branch is taken (case 3) Preparing for recovery involves working on alternative parh On instruction level Two fetch address registers per speculated branch (PPC 603 & 640) Two instruction buffers (IBM 360/91, SuperSPARC, Pentium) On I-cache level For PT, also do next-line prefetching For PNT, also do target-line prefetching
Predicting Dynamic BTAs Vast majority of dynamic BTAs come from procedure returns (85% for SPEC95) also case/switch statements indirect procedure calls, etc. OOP (C++, java) Procedure call-return for the most part follows a stack discipline a specialized return address buffer operated as a stack is appropriate for high prediction accuracy Pushes return address on call Pops return address on return Depth of RAS should be as large as maximum call depth to avoid mispredictions 8-16 elements generally sufficient