branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques
branch.2 10/14 I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors have pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty Why Branch Prediction work lost if pipeline makes wrong prediction ~ Loop length x pipeline width
branch.3 10/14 Branch Penalties in a Superscalar are extensive
branch.4 10/14 Reducing Control Flow Penalty Software solutions Minimize branches - loop unrolling Increases the run length Hardware solutions Find something else to do - delay slots Speculate –Dynamic branch prediction Speculative execution of instructions beyond branch
branch.5 10/14 Motivation: Branch penalties limit performance of deeply pipelined processors Much worse for superscalar processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Dynamic Prediction HW: Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: Keep computation result separate from commit Kill instructions following branch Restore state to state following branch Branch Prediction
branch.6 10/14 Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate JZ backward 90% forward 50%
branch.7 10/14 Branch Prediction Needs Target address generation –Get register: PC, Link reg, GP reg. –Calculate: +/- offset, auto inc/dec –Target speculation Condition resolution –Get register: condition code reg, count reg., other reg. –Compare registers –Condition speculation
branch.8 10/14 Target address generation takes time
branch.9 10/14 Condition resolution takes time
branch.10 10/14 Solution: Branch speculation
branch.11 10/14 Branch Prediction Schemes 1.2-bit Branch-Prediction Buffer 2.Branch Target Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.Integrated Instruction Fetch Units 6.Return Address Predictors (for subroutines, Pentium, Core Duo) 7.Predicated Execution (Itanium)
branch.12 10/14 Dynamic Branch Prediction learning based on past behavior Incoming stream of addresses Fast outgoing stream of predictions Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value } History Information
branch.13 10/14 Branch History Table (BHT) Table of predictors Each branch given its own predictor BHT is table of “Predictors” –Could be 1-bit or more –Indexed by PC address of Branch Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): –End of loop case: when it exits loop –First time through loop, it predicts exit instead of looping most schemes use at least 2 bit predictors Performance = ƒ(accuracy, cost of misprediction) –Misprediction Flush Reorder Buffer In Fetch state of branch: –Use Predictor to make prediction When branch completes –Update corresponding Predictor Predictor 0 Predictor 7 Predictor 1 Branch PC
branch.14 10/14 Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken?
branch.15 10/14 Better Solution: 2-bit scheme where change prediction only if get misprediction twice: Red: stop, not taken Green: go, taken Adds hysteresis to decision making process 2-bit Dynamic Branch Prediction more accurate than 1-bit T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T
branch.16 10/14 BTB: Branch Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PCPredicted PC =? PC of instruction FETCH prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb
branch.17 10/14 BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction
branch.18 10/14 Combining BTB and BHT BTB entries considerably more expensive than BHT, fetch redirected earlier in pipeline - can accelerate indirect branches (JR) BHT can hold many more entries - more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHT BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage
branch.19 10/14 Subroutine Return Stack Small stack – accelerate subroutine returns more accurate than BTBs. Push return address when function call executed Pop return address when subroutine return decoded &nexta &nextb &nextc k entries (typically k=8-16)
branch.20 10/14 Mispredict Recovery In-order execution machines: –Instructions issued after branch cannot write-back before branch resolves –all instructions in pipeline behind mispredicted branch Killed
branch.21 10/14 Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither store result nor cause exception –Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. –IA-64: 64 1-bit condition fields selected so conditional execution of any instruction –This transformation is called “if-conversion” Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Predicated Execution
branch.22 10/14 Accuracy v. Size (SPEC89)
branch.23 10/14 Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch. Tournament Predictor: more resources to competitive solutions and pick between them Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Return address stack for prediction of indirect jump