Branch prediction Avi Mendelson, 4/2005 1 MAMAS – Computer Architecture Branch Prediction Dr. Avi Mendelson Some of the slides were taken from: Dr. Lihu.

Slides:

Advertisements

Similar presentations

Pipeline Example: cycle 1 lw R10,9(R1) sub R11,R2, R3 and R12,R4, R5 or R13,R6, R7.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Pipelining and Control Hazards Oct

Dynamic Branch Prediction

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Computer Structure 2013 – Pipeline 1 Computer Structure MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson.

Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

OOOE & Exception © Avi Mendelson 05/ MAMAS – Computer Architecture Out Of Order Execution cont. Lecture #8-9 Dr. Avi Mendelson Alex Gontmakher Some.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Goal: Reduce the Penalty of Control Hazards

Branch Target Buffers BPB: Tag + Prediction

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Computer Architecture 2012 – advanced branch prediction 1 Computer Architecture Advanced Branch Prediction By Dan Tsafrir, 21/5/2012 Presentation based.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.

Branch Hazards and Static Branch Prediction Techniques

Computer Structure Advanced Branch Prediction

Computer Architecture 2015 – Advanced Branch Prediction 1 Computer Architecture Advanced Branch Prediction By Yoav Etsion and Dan Tsafrir Presentation.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Computer Architecture 2015– Pipeline 1 Computer Architecture Pipeline By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi.

Computer Structure 2015 – Pipeline 1 Computer Structure Pipeline Lecturer: Aharon Kupershtok Created by Lihu Rappoport.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Lecture: Out-of-order Processors

Stalling delays the entire pipeline

Computer Structure Advanced Branch Prediction

Computer Architecture Advanced Branch Prediction

CS5100 Advanced Computer Architecture Advanced Branch Prediction

Chapter 4 The Processor Part 4

Morgan Kaufmann Publishers The Processor

Computer Architecture MIPS Pipeline

So far we have dealt with control hazards in instruction pipelines by:

The Processor Lecture 3.6: Control Hazards

Pipelining and control flow

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Pipelining (II).

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Wackiness Algorithm A: Algorithm B:

So far we have dealt with control hazards in instruction pipelines by:

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Computer Structure Pipeline

Computer Structure Advanced Branch Prediction

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

branch prediction Avi Mendelson, 4/ MAMAS – Computer Architecture Branch Prediction Dr. Avi Mendelson Some of the slides were taken from: Dr. Lihu Rapoport

branch prediction Avi Mendelson, 4/ Control Hazard and Branch Prediction

branch prediction Avi Mendelson, 4/ Control Hazard on Branches And Beq sub sw The 3 instructions following the branch get into the pipe even if the branch is taken Inst from target IMRegDM Reg PC IMRegDM Reg IMRegDM Reg IMRegDM Reg IMRegDM Reg

branch prediction Avi Mendelson, 4/ Static Option 1: Stall  Stall pipe when branch is encountered until resolved  Stall impact: assumptions – CPI = 1 – 20% of instructions are branches – Stall 3 cycles on every taken branch  CPI new = × 3 = 1.6 (CPI new = CPI Ideal + avg. stall cycles / instr.) We loose 60% of the performance

branch prediction Avi Mendelson, 4/ Static Option 2: Delayed Branch  Define branch to take place AFTER n following instruction –HW executes n instructions following the branch regardless of branch is taken or not  SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution –Instructions that are before the branch instruction, or –Instructions from the converged path after the branch  If cannot find independent instructions, put NOP Original Code r3 = 23 R4 = R3+R5 If (r1==r2) goto x R1 = R4 + R5 X: R7 = R1 New Code If (r1==r2) goto x r3 = 23 R4 = R3 +R5 NOP R1 = R4 + R5 X: R7 = R1 

branch prediction Avi Mendelson, 4/ Delayed Branch Performance  Filling 1 delay slot is easy, 2 is hard, 3 is harder  Assuming we can effectively fill d% of the delayed slots CPI new = × (3 × (1-d))  For example, for d=0.5, we get CPI new = 1.3  Mixing architecture with micro-arch –New generations requires more delay slots –Cause computability issues between generations

branch prediction Avi Mendelson, 4/ Static Option 3: Predict Not Taken  Execute instructions from the fall-through (not-taken) path –As if there is no branch –If the branch is not-taken (~50%), no penalty is paid  If branch actually taken –Flush the fall-through path instructions before they change the machine state (memory / registers) –Fetch the instructions from the correct (taken) path  Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3

branch prediction Avi Mendelson, 4/ Dynamic Branch Prediction Look up PC of inst in fetch ?= Branch predicted taken or not taken No:Inst is not pred to be branch Yes:Inst is pred to be branch Branch PC Target PC History Predicted Target  Add a Branch Target Buffer (BTB) the predicts (at fetch) –Instruction is a branch –Branch taken / not-taken –Taken branch target

branch prediction Avi Mendelson, 4/ BTB  Allocation –Allocate instructions identified as branches (after decode)  Both conditional and unconditional branches are allocated –Not taken branches need not be allocated  BTB miss implicitly predicts not-taken  Prediction –BTB lookup is done parallel to IC lookup –BTB provides  Indication that the instruction is a branch (BTB hits)  Branch predicted target  Branch predicted direction  Branch predicted type (e.g., conditional, unconditional)  Update (when branch outcome is known) –Branch target –Branch history (taken / not-taken)

branch prediction Avi Mendelson, 4/ BTB (cont.)  Wrong prediction –Predict not-taken, actual taken –Predict taken, actual not-taken, or actual taken but wrong target  In case of wrong prediction – flush the pipeline –Reset latches (same as making all instructions to be NOPs) –Select the PC source to be from the correct path  Need get the fall-through with the branch –Start fetching instruction from correct path  Assuming P% correct prediction rate CPI new = 1 + (0.2 × (1-P)) × 3 –For example, if P=0.7 CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

branch prediction Avi Mendelson, 4/ Adding a BTB to the Pipeline ALUSrc 6 ALU result Zero + Shift left 2 ALU Control ALUOp RegDst RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File [15-0] [20-16] [15-11] Sign extend ID/EX EX/MEM MEM /WB Instruction MemRead MemWrite Address Write Data Read Data Memory Branch PCSrc MemtoReg 4 + IF/ID PC 0 1 muxmux 0 1 muxmux 0 muxmux 1 0 muxmux Inst. Memory Address Instruction BTB 1 2 pred target pred dir PC+4 (Not-taken target) taken target 3 Mispredict Detection Unit Flush predicted target PC+4 (Not-taken target) predicted direction − 4 address target direction alloc/updt

branch prediction Avi Mendelson, 4/ Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst BTB gets PC and looks it up IF/ID latch loaded with new inst BTB Hit ?Br taken ? PC  PC + 4PC  perd addr IF ID IF/ID latch loaded with pred inst IF/ID latch loaded with seq. inst Branch ? yesno yes noyes EXE

branch prediction Avi Mendelson, 4/ Using The BTB (cont.) ID EXE MEM WB Branch ? Calculate br cond & trgt Flush pipe & update PC Corect pred ? yesno IF/ID latch loaded with correct inst continue Update BTB yes no continue

branch prediction Avi Mendelson, 4/ Advance Branch Prediction

branch prediction Avi Mendelson, 4/ Introduction  Need to predict: –Conditional branch direction (taken or no taken)  Actual direction is known only after execution  Wrong direction prediction causes a full flush –All taken branch (conditional taken or unconditional) targets  Target of direct branches known at decode  Target of indirect branches known at execution –Branch type  Conditional, uncond. direct, uncond. indirect, call, return  Target: minimize branch misprediction rate for a given predictor size

branch prediction Avi Mendelson, 4/ Branches and Performance  MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions  MPI correlates well with performance. For example: –MPI = 1% (1 out of 100 instructions  1 out of 20 branches) –IPC=2 (IPC is the average number of instructions per cycle), –flush penalty of 10 cycles  We get:  MPI = 1%  flush in every 100 instructions  flush in every 50 cycles (since IPC=2),  10 cycles flush penalty every 50 cycles  20% in performance

branch prediction Avi Mendelson, 4/ Target Array  TA is accessed using the branch address (branch IP)  Implemented as an n-way set associative cache  Tags usually partial –Save space –Can get false hits –Few branches aliased to the same entry –No correctness only performance  TA predicts the following –Indication that instruction is a branch –Predicted target –Branch type  Unconditional: take target  Conditional: predict direction  TA allocated / updated at execution Branch IP tag target predicted target hit / miss (indicates a branch) type predicted type

branch prediction Avi Mendelson, 4/ Conditional Branch Direction Prediction

branch prediction Avi Mendelson, 4/ One-Bit Predictor  Problem: 1-bit predictor has a double mistake in loops Branch Outcome Prediction? branch IP Prediction (at fetch): previous branch outcome counter array / cache Update (at execution) Update bit with branch outcome

branch prediction Avi Mendelson, 4/ Bimodal (2-bit) Predictor  A 2-bit counter avoids the double mistake in glitches –Need “more evidence” to change prediction  2 bits encode one of 4 states –00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken –Initial state: weakly-taken (most branches are taken)  Update –Branch was actually taken: increment counter (saturate at 11) –Branch was actually not-taken: decrement counter (saturate at 00)  Predict according to m.s.bit of counter (0 – NT, 1 – taken)  Predicts well monotonic branches: one mistake per loop iteration  Does not predict well branches with patterns like 0101…01 00 SNT taken not-taken taken not-taken 01 WNT 10 WT 11 ST

branch prediction Avi Mendelson, 4/ l.s. bits of branch IP Prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome Bimodal Predictor (cont.)

branch prediction Avi Mendelson, 4/ Bimodal Predictor - example  Br1 prediction –Pattern: –counter: –Prediction:TTTTTT  Br2 prediction –Pattern: –counter: –Prediction:T nTT nTT nT  Br3 prediction –Pattern: –counter: –Prediction:T TT TT T Code:  Loop: ….  br1: if (n/2) {  ……. }  br2: if ((n+1)/2) {  ……. }  n--  br3: JNZ n, Loop

branch prediction Avi Mendelson, 4/ Level Prediction: Local Predictor  Save the history of each branch in a Branch History Register (BHR): –A shift-register updated by branch outcome –Saves the last n outcomes of the branch –Used as a pointer to an array of bits specifying direction per history  Example: assume n=6 –Assume the pattern –At the steady-state, the following patterns are repeated in the BHR:  Following , , the jump is not taken  Following the jump is taken BHR 0 2 n -1 n

branch prediction Avi Mendelson, 4/ Local Predictor (cont.)  There could be glitches from the pattern –Use 2-bit saturating counters instead of 1 bit to record outcome:  Too long BHRs are not good: –Past history may be no longer relevant –Warm-Up is longer –Counter array becomes too big Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history BHR

branch prediction Avi Mendelson, 4/ Local Predictor: private counter arrays Branch IP taghistory prediction = msb of counter 2-bit-sat counter arrays Update counter with branch outcome Update History with branch outcome history cache Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size ) Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × ( ×2 6 ) = 142Kbit Holding BHRs and counter arrays for many branches:

branch prediction Avi Mendelson, 4/ Local Predictor: shared counter arrays  Using a single counter array shared by all BHR’s –All BHR’s index the same array –Branches with similar patterns interfere with each other prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6) + 2×2 6 = 14.1Kbit

branch prediction Avi Mendelson, 4/ Local Predictor: lselect  lselect reduces inter-branch-interference in the counter array prediction = msb of counter Branch IP 2-bit-sat counter array taghistory history cache h h+m m l.s.bits of IP Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

branch prediction Avi Mendelson, 4/ Local Predictor: lshare  lshare reduces inter-branch-interference in the counter array by mapping common patterns in different branches to different counters h h h l.s.bits of IP history cache taghistory prediction = msb of counter Branch IP 2-bit-sat counter array Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

branch prediction Avi Mendelson, 4/  The behavior of some branches is highly correlated with the behavior of other branches: if (x < 1)... if (x > 1)...  Using a Global History Register (GHR), the prediction of the second if may be based on the direction of the first if  For other branches the history interference might be destructive Global Predictor

branch prediction Avi Mendelson, 4/ Global Predictor (cont.) Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome history GHR The predictor size: history_size + 2*2 history_size Example: history_size = 12  size = 8 K Bits

branch prediction Avi Mendelson, 4/ gshare combines the global history information with the branch IP Global Predictor: Gshare prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome Branch IP history GHR Update History with branch outcome

branch prediction Avi Mendelson, 4/ Chooser  The chooser may also be indexed by the GHR +1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct Bimodal / Local Global Branch IP Prediction Chooser array (an array of 2-bit sat. counters) GHR  A chooser selects between 2 predictor that predict the same branch –Use the predictor that was more correct in the past

branch prediction Avi Mendelson, 4/ Speculative History Updates  Deep pipeline  many cycles between fetch and branch resolution –If history is updated only at resolution  Local: future occurrences of the same branch may see stale history  Global: future occurrences of all branches may see stale history –History is speculatively updated according to the prediction  History must be corrected if the branch is mispredicted  Speculative updates are done in a special field to enable recovery  Speculative History Update –Speculative history updated assuming previous predictions are correct –Speculation bit set to indicate that speculative history is used –Counter array is not updated speculatively: prediction can change (state change from 01 to 10 or 10 to 01) only on a misprediction  On branch resolution –Update the real history and reset speculative histories if mispredicted

branch prediction Avi Mendelson, 4/ Return Stack Buffer  A return instruction is a special case of an indirect branch: –Each times it jumps to a different target –The target is determined by the location of the corresponding call instruction  The idea: –Hold a small stack of targets –When the target array predicts a call  Push the address of the instruction which follows the call into the stack –When the target array predicts a return  Pop a target from the stack and use it as the return address

branch prediction Avi Mendelson, 4/ Branch Prediction in commercial Processors

branch prediction Avi Mendelson, 4/  386 / 486 –All branches are statically predicted Not Taken  Pentium –IP based, 2-bit saturating counters (Lee-Smith) –BTB miss - statically predicted Not Taken Older Processors

branch prediction Avi Mendelson, 4/ Intel Pentium III  2-level, local histories, per-set counters  4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter Way 0Way 1 Target Way 2 Way counters 128 sets PTV LRR 2 Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer

branch prediction Avi Mendelson, 4/ Alpha LG Chooser Counters 4 ways 256 Histories IP In each entry: 6 bit tag + 10 bit History Counters GHR Counters M X U Global Local Chooser 2  New entry on the Local stage is allocated on a global stage miss- prediction  Chooser state-machines: 2 bit each: –one bit saves last time global correct/wrong, –and the other bit saves for the local correct/wrong  Chooses Local only if local was correct and global was wrong

branch prediction Avi Mendelson, 4/ Pentium® M  Combines 3 predictors –Bimodal, Global and Loop predictor  Loop predictor analyzes branches to see if they have loop behavior –Moving in one direction (taken or NT) a fixed number of times –Ended with a single movement in the opposite direction

branch prediction Avi Mendelson, 4/ Pentium® M – Indirect Branch Predictor  The target of indirect branches is data dependent –Part of indirect branches still have a single target at run time –Some have many targets  E.g., case statement in a Java byte-code interpreter  Indirect branches heavily used in object-oriented code (C++, Java)  became a growing source of branch mispredictions  Indirect branch is resolved at execution  high misprediction penalty  A dedicated indirect branch target predictor (iTA) –Chooses targets based on a global history (similar to global predictor)  Initially indirect branch is allocated only in the target array (TA)  If target is mispredicted, allocate an entry in the iTA corresponding to the global history leading to this instance of the indirect branch –Monotonic indirect branches are still predicted by the TA –Data-dependent indirect branches allocate as many targets as needed

branch prediction Avi Mendelson, 4/ Indirect branch target prediction (cont)  Prediction from the iTA is used if –TA indicates an indirect branch –iTA hits for the current global history (XORed with branch address) Target Array Indirect Target Predictor Branch IP Predicted Target M X U hit indirect branch hit Predicted Target HIT GHR History

branch prediction Avi Mendelson, 4/ Backup

branch prediction Avi Mendelson, 4/ BHT - Branch History Table 2-level 8,192-entry global predictor: 16 Entry BTC - Branch Target Cache  Supplies the first 16 bytes of target instructions to the decoders when the branches are predicted.  Organized as 16 entries of 16 bytes.  Avoids a bubble for correct predictions.  No Target Address Buffer: Address ALUs calculate target addresses on-the-fly during decode 16 Entry RAS - Return Address Stack  Caches the return addresses 2-bit-sat counter array 13 bit global history GHR AMD-K6

branch prediction Avi Mendelson, 4/  256 X 3- bit branch history table (BHT)  Instead of 2-bit-sat counters, stores the results of the last three iterations of each branch  The prediction is based on a majority vote of the three bits  Offers a similar level of hysteresis and accuracy as 2-bit-sat, but easier to update (shift results vs. read-modify-write)  The BHT is only updated as branch instructions are retired –prevents corrupting the history information with speculative executions of the branch HP PA-8000