Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
ECE232: BrPredict 2 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch Instructions Cause Control Hazards I n s t r. O r d e r lw Inst 4 Inst 3 beq ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg FDEXMW FD MW jr
ECE232: BrPredict 3 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren BEQ resolved during the MEM stage PCSrc Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU Shift left 2 Add Data Memory Address Write Data Read Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control Branch
ECE232: BrPredict 4 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren stall One Way to “Fix” a Control Hazard I n s t r. O r d e r beq ALU IM Reg DMReg lw ALU IM Reg DMReg ALU Inst 3 IM Reg DM Fix branch hazard by waiting – introduce stalls
ECE232: BrPredict 5 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Reducing branch penalty through HW design
ECE232: BrPredict 6 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Reducing Control Hazards’ Penalties Stalls – hurts performance Deeper pipelines have higher penalties 1. Move decision point as early in the pipeline as possible – reduces number of stalls at the cost of additional hardware 2. Delay decision (requires compiler support) – “Delayed Branch”: not effective for deeper pipes - requiring more than one delay slot to be filled 3. Predict outcome of branch beq $1,$2,NEXT add $4,$3,$5 sub $7,$2,$8 NEXT
ECE232: BrPredict 7 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch Prediction Easiest - static prediction Always taken, always not taken Opcode based Displacement based (forward not taken, backward taken) Compiler directed (branch likely, branch not likely) Dynamic prediction – prediction per branch in program 1 bit predictor – remember last taken/not taken per branch Use a branch-history table (BHT) with 1 bit entry Use part of the PC (low-order bits) to index table – Why? Multiple branches may share the same bit Invert the bit if prediction is wrong Predictor 0 Predictor 127 Predictor 1 Branch PC BHT
ECE232: BrPredict 8 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch Prediction 1 bit predictor Backward branches for loops will be mispredicted twice EX: If a loop branches 9 times in a row and not taken once, what is the prediction accuracy? Misprediction at the first and last loop iteration => 80% prediction accuracy, although branch is taken 90% Modern processors – multiple instructions issued per cycle, more branch hazards will occur per cycle Cost of branch mispredicted goes up Pentium II – 3 instructions issued per cycle, 12+ cycle misprediction penalty Huge penalty for a misfetched path following a branch T... TTT T N TT... N
ECE232: BrPredict 9 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren 2-bit Branch Prediction 4 states instead of 2, allowing for more information about tendencies A prediction must miss twice before it is changed Good for backward branches of loops 2-bit saturating counter T T N T N T N N Predict Taken Predict not taken T... TTT T T TT... N
ECE232: BrPredict 10 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch History Table - BHT 01 BHT branch PC 2 bits by N (e.g. 4K entries) Uses low-order bits of branch PC to choose entry Plot misprediction instead of prediction Predictor 0 Predictor 4095 Predictor 1 01
ECE232: BrPredict 11 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Is Branch Predictor Enough? When is using branch prediction beneficial? Clearly when the outcome is known later than the target Otherwise - If we predict the branch is taken (and suppose it is correct), what is the target address? Need a mechanism to provide target address as well Use a Branch Target Buffer (BTB) that includes the target address Can we eliminate the one cycle delay for the 5-stage pipeline? Need to fetch from branch target immediately after branch was fetched
ECE232: BrPredict 12 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch Target Buffer (BTB) BTB is a cache that contains the predicted PC value instead of whether the branch will take place or not (Ex. Loop address) Is the current instruction a branch ? BTB provides the answer before the current instruction is decoded and therefore enables fetching to begin after IF-stage (for branch) What is the branch target ? BTB provides the branch target if the prediction is a taken branch (for not taken branches the target is simply PC+4 )
ECE232: BrPredict 13 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren BTB
ECE232: BrPredict 14 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren BTB operations BTB hit, prediction taken → 0 cycle delay BTB hit, misprediction ≥ 2 cycle penalty – Correct BTB BTB miss, branch ≥ 1 cycle penalty (Detected at the ID stage and entered in BTB) Taken Branch? Entry found in branch- target buffer? Send out predicted PC Is instruction a taken branch? Send PC to memory and branch-target buffer Enter branch instruction address and next PC into branch-target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; update target buffer Normal instruction execution Branch correctly predicted; continue execution with no stalls No Yes No ID IF EX
ECE232: BrPredict 15 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren Branch Prediction Summary The better we predict, the lower penalty we might incur 2-bit predictors capture tendencies well Correlating predictors improve accuracy, particularly when combined with 2-bit predictors Accurate branch prediction does no good if we don’t know there was a branch to predict BTB identifies branches in IF stage BTB combined with branch prediction table identifies branches to predict, and predicts them well