Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction.

Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction

Korea UniversityG. Lee - 2009 2 Dynamic Branch Prediction Predict branch outcome at run time with where target instruction is.  avoid control hazard  heavy effects on multi-issue processors example: bne add … sub IF ID … IF To avoid stall, needs to know which one, either ADD or SUB, to fetch even before the branch is decoded

Korea UniversityG. Lee - 2009 example : suppose branch comes every six instructions. If the rates of prediction success are Static-Taken: 70% and Dynamic: 90%. Assuming 2-cycle stall for mis-prediction (and no other stalls in pipe), With single -issue CPI = 1 + (0.3*2)1/6 = 1.1 for static CPI = 1 + (0.1*2)1/6 = 1.03 for dynamic About 7% difference With 6-issue, Branch comes six times fast CPI = 1 + (0.3*2)6/6 = 1.6 for static CPI = 1 + (0.1*2)6/6 = 1.2 for dynamic About 30% difference! (if one commit/cycle) CPI = 1/6 + (0.3*2)6/6 = 0.76 for static CPI = 1/6 + (0.1*2)6/6 = 0.26 for dynamic About 300% difference! (if six commit/cycle)

Korea UniversityG. Lee - 2009 4 Branch Prediction What to predict  Branch direction (taken or not taken) For conditional branch Harder part  Branch target if taken When to predict  Target at IF stage; Direction could be later but earlier than EX When to verify  At the end of EX of branch. Branch is resolved. Predictor type  Static: always assume branch is either taken or not taken  Dynamic: changes over time -> our focus IF ID IS EX EX.. Add r1, r2, r3 load r4, 100(r5) Subi r4,r4, 200 Store r4, 120(r5) Addi r5, r5,1 BNE r1, r5, offset IF ID IS EX EX..

Korea UniversityG. Lee - 2009 5 Branch Misprediction recovery A C B D Predicted Path Actual Path Assume branch at A is mis-predicted. Program should 1.redirect the fetch point to another Branch of A and 2.cancel/nullify the instructions in B and D. The mis-prediction penalty is the cycles between the time when branch is predicted (at fetch) to the time when branch is resolved (typically at the end of EX). All instructions fetched, decoded, and executed in between should be canceled. x

Korea UniversityG. Lee - 2009 6 Dynamic Branch Prediction With single-issue pipe, dynamic branch prediction may be a novel scheme, but an essential feature for multiple-issue pipes Dynamic Prediction based on branch history Just looking at the history of the branch for prediction → prediction in isolation Looking at the history of other branches in addition to the branch for prediction → correlating prediction

Korea UniversityG. Lee - 2009 7 Dynamic Branch Prediction example: … bne … … beq … Consider the case of prediction for “beq”, Consider the history of “beq” only (prediction-in-isolation) or the histories of “bne” and “beq” together (correlating prediction).

Korea UniversityG. Lee - 2009 8 Branch history Q: How many previous branch decisions to consider (branch history depth) to have a good prediction success? One-bit history aka last-value prediction: what was previous branch decision Start with prediction either T or N If wrong, change prediction to the other for next Prediction in isolation 0 N1 T T T N N

Korea UniversityG. Lee - 2009 9 More bits will record more history making the prediction more accurate (or maybe NOT?) Two-bit history (prediction) bits (based on static profiling) 2-bit history Profile of Taken(%) Prediction NN (00)11N NT (01)61T TN (10)54T TT (11)97T 2-bit History 00 N 01 T 11 T 10 T T T N N T N N T State variable is the branch history

Korea UniversityG. Lee - 2009 How large n might be? ncompiler or BusinessScientific System 0 64.1 64.470.454.0 1. 91.9 95.286.679.7 2 93 96.590.883.4 3 93.7 96.691.083.5 4 94.5 96.891.883.7 5 94.797.092.083.9 note: 0-bit is static Taken prediction Even with ∞ bits, it improves little over 2-bit prediction.

Korea UniversityG. Lee - 2009 11 1-bit predictor might be too sensitive. Bi-modal Predictor counting mis-predictions instead of recording branch history for i =1; i <= 5; i++ for j = 1; j<=10; j++ Do something Label1: i = i +1; Label2: do something j = j + 1; ble j, 10, label2 ble i,5, Label1 For each inner loop, the blue branch will be mis-predicted twice

Korea UniversityG. Lee - 2009 12 Bi-modal predictor:  2-bit “saturating” counter: state variable is a number  Only Two consecutive mis-predictions cause prediction change. Bi-modal (saturating counter) predictor 0N0N 1N 2 T 3 T T(+1) T N N(-1) State variable is a counter N (-1) With the same hardware resource, bi-modal predictor has a better prediction accuracy than 2-bit history one.

Korea UniversityG. Lee - 2009 Hardware organization PC each entry: an n-bit counter/history 32-bit l-bit (0<l<=32) Multiple branches could be mapped into one: entry – aliasing problem or resolution issue How many entries in the prediction table?

Korea UniversityG. Lee - 2009 14 A branch decision may be affected by other branch decisions: Correlating prediction If (aa == 2) aa =0; If (bb == 4) bb =0; If (aa != bb) {….. if the first two conditions are true then the third will be false

Korea UniversityG. Lee - 2009 15 Correlating Branch Predictor If we use 2 branches as histories, then there are 4 possibilities (T-T, N-T, N-N, T-T). For each possibility, we need to use a predictor. And this repeats for every branch. (2,2) branch prediction 2 4 = 16

Korea UniversityG. Lee - 2009 16 Another way to view correlating branch predictor Save recent branch outcomes to approximate the control paths followed → Branch History Register (BHR) Some people called BHSR: Branch history shift register. Shift Register of m-bit to hold branch outcome of the last m branch instruction executions (!recall it’s dynamic prediction!) whenever a branch decision is made, BHR is shifted out with a new decision bit shifted in. T N N T T N T BHR 0 01 010

Korea UniversityG. Lee - 2009 17 Performance of Correlating Branch Prediction With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor. Outperforms a 2-bit predictor with infinite number of entries

Korea UniversityG. Lee - 2009 18 Correlating Predictor note: (0, 2) predictor is a 2-bit prediction in isolation sometimes m and n represent the same branch instruction  e.g. loop closing branch without any other branch in the loop body. Note: entry is not unique to a specific branch:  Program can follow different execution path, thereby different BHR, to reach one particular branch larger m may provide better resolution leading to better accuracy: 10 or 12 seems popular

Korea UniversityG. Lee - 2009 19 Correlating Predictor (m, n) predictor m: m-bit (global) BHR n: n-bit history bits or counter (per local branch) Using PC and BHR to access branch prediction/history table (table of history/prediction bits: most cases 2- bit history table)

Korea UniversityG. Lee - 2009 PC BHR Prediction m-bit gshare Predictor by McFarling 2 m entry history table of 2-bit history/counter predictor xx PC and BHR can be to access 2-bit history table: either Concatenated or XORed (partially or fully) BHR information, as well as branch’s PC, is used to index into an array of isolated predictor Branch History Table (BHT) Pattern History Table (PHT)

Korea UniversityG. Lee - 2009 21 2-level Predictors – extended idea BHR and BHR table We can have one BHR (global BHR) for a program (G) Only one register that is read and updated by any branch Or per address BHR (P) BHR table indexed by a portion of PC bits Each BHR is dedicated to one particular branch Use current branch’s PC to locate one BHR and update/read that BHR. one global BHR Read and update by all branches PC BHR table contains multiple BHRs Read and update by one particular branch

Korea UniversityG. Lee - 2009 22 2-level Predictors PHT (Pattern History Table)  Each entry in PHT contains a n-bit history/counter predictor  We can only have one PHT indexed by BHR (G)  Or per address PHT (a set of PHTs) Use PC to locate a PHT first, then use BHR to locate one particular Entry. Each PHT is dedicated to one particular branch one global PHT PC BHR bits Multiple PHT gAp n –bit history/counter predictor

Korea UniversityG. Lee - 2009 23 xAy predictor - Gag Yeh and Patt proposed 2-level predictor - xAy  A means adaptive  x: BHR organization ; y: PHT organization G: global; p: individual e.g. Gag: global BHR with global PHT A variation of Gag: gshare by McFarling one global PHT BHR bits one global BHR BHR bits PC bits Index of PHT is randomized

Korea UniversityG. Lee - 2009 24 PAg Predictor → per address BHR (local BHR) with single global PHT (now BHRs in a form of table: Table of BHRs) → use PC as Tag to match instruction address to a specific local BHR Surprisingly, BHR alone without PC can improve prediction success rate if PHT size is big (>4K entry) and BHR size is big (> 12 bits) xAy predictor -PAg BHR global PHT prediction bb BHR pc

Korea UniversityG. Lee - 2009 25 Pap predictor Per address BHR with per address PHT → use PC as Tag to match instruction address to BHR and PHT, and then use BHR to match PHT entry xAy predictor -PAp BHR PHTs prediction bb BHR pc bb

Korea UniversityG. Lee - 2009 26 S. McFarling, “Combining branch predictors”, WRL technical note TN-36, June 1993. Hybrid/Tournament Predictor Each predictor has its own advantage, works better than the other in certain situations. → combine two different predictors to create better, i.e. more accurate predictor → needs to have a predictor of predictors Meta-predictor Combining Predictor Strong A Weak A Weak BStrong B A: W & B: R A: R & B: W A: W & B: R A: R & B: W A: W & B: R e.g. 2-bit saturating counter as a meta-predictor choosing one of the two predictors – local and Gshare Recall how 2-bit saturating counter works: two consecutive false predictions change the predictor

Korea UniversityG. Lee - 2009 27 Branch Prediction – Alpha 21264 PC 12-bit path 10bit BHR 3-bit 2-bit Local(pAg) Global (gAg) 2-bit saturating counters 1024 4096 BHR last 12 branches Different from Mcfarling’s

Korea UniversityG. Lee - 2009 28 Branch Prediction Tournament with meta predictor Aliasing  In the same process  Between the threads Effectiveness of BHR how about path history, instead of T or N, one may take (portion of ) addresses followed

Korea UniversityG. Lee - 2009 29 Branch Target Buffer (BTB) Recall Prediction alone does not remove stalls due to control hazard: branch: IF ID … IF To avoid stall, even without knowing the fetched instruction is branch, PC for the next instruction should be loaded. Fetch target address at instruction fetch

Korea UniversityG. Lee - 2009 G. Lee 30 Branch Target Buffer To reduce restart delay, Branch Target Buffer (BTB)  small faster cache holding target addresses  indexed with PC of conditional branching instr  Each entry contains the branch’s PC as the tag to guarantee current instruction is the branch buffered in BTB.  accessed at the same time of I-Fetch  sometimes, extension of I-cache

Korea UniversityG. Lee - 2009 31

Korea UniversityG. Lee - 2009 32 BTB operation An entry found in BTB and target is correct (prediction is right) Go to the target An entry found in BTB and target is incorrect (prediction is wrong) Mis-predicted: Need a recovery a update of BTB No such an entry in BTB – fetch next PC and next PC is the correct PC Execute the next instruction No such an entry in BTB – fetch next PC and next PC is not the correct PC Need a recovery and a update of BTB

Korea UniversityG. Lee - 2009 33 e.g. BTB-cache with tag of branch instr. addresses access with PC as index entry:target addrprediction – n/t (target instruction) branch IFID access TIF TID I-Cache & BTB actual check If BTB hit branch predictionif wrong Update PC decision if wrongprediction based on reverse predictionprediction & update BTB Assume branch is resolved

Korea UniversityG. Lee - 2009 34 Note: when to put branch instr. into BTB no need to put instr. executed only once → Optimizing BTB design How large?1K to 8K entries?!? When to put First time branch instruction is executed First time TAKEN branch is made: better hit When to kick out (replacement) Doesn’t matter much, usual LRU is OK Branch Target Buffer

Korea UniversityG. Lee - 2009 35 Branch Folding In BTB, Target Instruction instead of Target address → Branch Folding: 0-cycle branch e.g. IF ID IFIDEX IF Fetch branch instr. and target address from BTB without folding

Korea UniversityG. Lee - 2009 36 Assuming separate decode for branch and other instructions, IFIDEX… ID EX (target instruction) IF Fetch branch and target instruction (from BTB) Decode branch and target both instructions If prediction is correct, proceed to EX stage Otherwise, fetch the correct target Note: Still 2-cycle delay if prediction is wrong but Free-branch if the prediction is correct. predicated instructions: generalized branch folding with tag of prediction bit compiler based approach as in Intel EPIC Branch Folding

Korea UniversityG. Lee - 2009 37 Unlike PC-relative with constant in most conditional branches, some branches use registers or memory locations for target address holder. For such indirect branches, target address changes frequently at run time. jr register; (the register contains the target) branch prediction scheme based on branch history does not work well. Indirect Jump

Korea UniversityG. Lee - 2009 38 Return Address Stack (RAS) Return address changes as calls coming from different places  Return address in previous jump may have nothing to do with the current instance of jump to return address BTB with last jump address as target does not work well :  only 51.8% prediction success with SPECint95. Even worse with speculative execution.  Majority of indirect branch is return  85% of indirect jumps = return Return Address Stack (implemented as a circular buffer in h/w)  When fetching a call instruction, push the next address into a stack  When fetching return, pop the address from stack before the return gets executed. The popped value is speculated as the target address Note the value could be wrong because hardware stack has a limited capacity and context switch

Korea UniversityG. Lee - 2009 39 Return Address Stack (RAS) Small fast HW stack cache with the most recent return address on top. If hit (i.e. the instr. is return) then update PC with address from the stack note: 1. with some instruction format, cache with tag is not necessary 2. How many slots in the RAS? Maximum call depth? Intel Pentium-3, 8 slots Alpha 21264, 32 slots pc: call xx pc+4: add yy PC+4 aa bb pc+100: call zz pc+104: sub yy ret PC+4 aa bb PC+104 xxx yyy ret PC+4 aa bb aa bb

Korea UniversityG. Lee - 2009 40 Integrated instruction fetch units An aggressive fetch unit: Important in multi-issue superscalar processor.  Integrated branch prediction: do both target prediction and direction prediction  Instruction pre-fetch: fetch ahead beyond the cache line.  Instruction memory access and buffering Memory provides a smooth instruction flow to fetch unit. Trace cache  Previously fetch boundary is the first branch in each cycle.  I-cache include “traces” rather than a consecutive block. In each cycle, fetch instructions from multiple branches

Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction.

Similar presentations

Presentation on theme: "Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction.

Similar presentations

Presentation on theme: "Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Dynamic Branch Prediction."— Presentation transcript:

Similar presentations

About project

Feedback