Presentation is loading. Please wait.

Presentation is loading. Please wait.

COSC3330 Computer Architecture Lecture 14. Branch Prediction

Similar presentations


Presentation on theme: "COSC3330 Computer Architecture Lecture 14. Branch Prediction"— Presentation transcript:

1 COSC3330 Computer Architecture Lecture 14. Branch Prediction
Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 Out-of-Order Execution Branch Prediction
Topic Out-of-Order Execution Branch Prediction

3 Superscalar Terminology
Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. Issue Width Number of instructions issued per cycle Out-of-order Able to execute instructions out of program order Register Renaming Able to dynamically assign physical registers to instructions Speculative Execution Able to run instructions speculatively (branch predictions)

4 A Dynamic Superscalar Processor
IF ID RD ( in order ) Dispatch Buffer ( out of order ) ALU FP1 MEM1 BR EX FP2 MEM2 FP3 ( out of order ) Reorder Buffer ( in order ) WB

5 Remember the Toll Booth?
5s 30s Hands toll-booth agent a $100 bill; takes a while to count the change One-at-a-time = 45s OOO = 30s With a “4-Issue” Toll Booth L1 L2 L3 L4 OOO = Out of Order We’ll add the equivalent of the “shoulder” to the CPU: the Re-Order Buffer (ROB)

6 Re-Order Buffer (ROB) Separates architected vs. physical registers
Tracks program order of all in-flight instructions Enables in-order completion or “commit”

7 Hardware Organization
Instruction Buffers RAT Architected Register File ROB Reservation Stations and ALUs “head” op Qj Qk Vj Vk Add op Qj Qk Vj Vk Mult type dest value fin

8 Circular Ring Buffer

9 Stall issue if any needed resource not available
Instruction Buffers RAT Architected Register File Read inst from inst buffer Check if resources available: Appropriate RS entry ROB entry Read RAT, read (available) sources, update RAT Write to RS and ROB Reservation Stations and ALUs ROB op Qj Qk Vj Vk Add op Qj Qk Vj Vk Mult Stall issue if any needed resource not available type dest value fin

10 Exec Same as before Wait for all operands to arrive
Compete to use functional unit Execute!

11 Write Result Broadcast result on CDB
(any dependents will grab the value) Write result back to your ROB entry The ARF holds the “official” register state, which we will only update in program order Mark ready/finished bit in ROB (note that this inst has completed execution)

12 New: Commit When an inst is the oldest in the ROB
i.e. ROB-head points to it Write result (if ready/finished bit is set) If register producing instruction: write to architected register file If store: write to memory Advance ROB-head to next instruction This is what the outside world sees And it’s all in-order

13 Commit Illustrated Make instruction execution “visible” to the outside world “Commit” the changes to the architected state ROB Outside World “sees”: WB result A ARF A executed B B executed C C executed D D executed E E executed F G H Instructions execute out of program order, but outside world still “believes” it’s in-order J K

14 James E. Smith Eckert–Mauchly Award for fundamental contributions to high performance micro-architecture, including saturating counters for branch prediction, reorder buffers for precise exceptions, …

15 Loose Ends Up to now: Techniques for handling register-related dependencies Register renaming for WAR, WAW Tomasulo’s algorithm for scheduling RAW Still need to address: Control dependencies

16 Branch Prediction/Speculative Execution
When we hit a branch, guess if it’s T or NT ADD A Guess T Branch LOAD DIV ADD Branch SUB STORE SUB LOAD XOR STORE ADD MUL B Keep scheduling and executing Instructions as if the branch Didn’t even exist T NT C Q Sometime later, if we messed up… D R Just throw it all out And fetch the correct instructions

17 Branches Kill! Branches are very frequent
Approx. 20% of all instructions Can not afford waiting until we know where it goes Long pipelines Branch outcome known after B cycles No scheduling past the branch until outcome known Superscalars (e.g. 4-way) Branch every cycle or so! One cycle of work, then bubbles for ~B cycles?

18 Categorizing Branches
Source: H&P using Alpha

19 Surviving Branches: Prediction
Predict Branches And predict them well! Fetch, decode, etc. on the predicted path Option 1: No execute until branch resovled Option 2: Execute anyway (speculation) Recover from mispredictions Restart fetch from correct path A B T NT C Q D R

20 Branch Misprediction Single Issue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue Mispredict

21 Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue (flush entailed instructions and refetch) Mispredict

22 Branch Misprediction Single Issue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue Mispredict 8-issue Superscalar Processor (Worst case)

23 Intel Quad Core

24 A9 (Apple A5)

25 Importance of Branches
Instruction Window for ILP If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is: 5*(1 + ½ + (½)2 + (½)3 + … ) = 10 If we halve the miss rate down to 25%: 5*(1 + ¾ + (¾)2 + (¾)3 + … ) = 20 Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from

26 Branch Prediction Need to know two things
Whether the branch is taken or not (direction) The target address if it is taken (target) Direct jumps, Function calls Direction known (always taken), target easy to compute Conditional Branches (typically PC-relative) Direction difficult to predict, target easy to compute Indirect jumps, function returns Direction known (always taken), target difficult

27 Branch Prediction: Direction
Needed for conditional branches Most branches are of this type Many, many kinds of predictors for this Static: fixed rule, or compiler annotation (e.g. “BEQL” is “branch if equal likely”) Dynamic: hardware prediction Dynamic prediction usually history-based Example: predict direction is the same as the last time this branch was executed

28 Why Branch Direction is Predictable?
if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) …. for (i=0; i<100; i++) { …. } addi r2, r0, 2 bne r10, r2, L_bb xor r10, r10, r10 j L_exit L_bb: bne r11, r2, L_xx xor r11, r11, r11 j L_exit L_xx: beq r10, r11, L_exit Lexit: addi r10, r0, 100 addi r1, r0, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1

29 Static Branch Prediction
Uni-directional, always predict taken (or not taken) Backward taken, Forward not taken Need offset information Compiler hints with branch annotation When the info will be available? Post-decode?

30 FSM of the Simplest Predictor
A 2-state machine Change mind fast 1 If branch taken If branch not taken Predict not taken 1 Predict taken

31 Example using 1-bit branch history table
addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. } Pred 1 1 1 1 1 1 1 1 1 Actual T T T T NT T T T T NT T 1 60% accuracy

32 2-bit Saturating Up/Down Counter Predictor
MSB: Direction bit LSB: Hysteresis bit 01/ WN 00/ SN 10/ WT 11/ ST Taken Not Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Predict Not taken Predict taken

33 2-bit Counter Predictor (Another Scheme)
01/ WN 00/ SN 11/ ST 10/ WT Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Not Taken Predict Not taken Predict taken

34 Example using 2-bit up/down counter
addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. } Pred 01 10 11 11 11 10 11 11 11 11 10 1 Actual T T T T NT T T T T NT T 01/ WN 00/ SN 10/ WT 11/ ST 80% accuracy

35 Bimodal Branch Prediction
PC Address 2N entries (each entry has a 2 bit counter) 1 N bits . table update 2N entries addressed by N-bit PC Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter FSM Update Logic Actual outcome Prediction

36 Global vs. Local Branch History
Local Behavior What is the predicted direction of Branch A given the outcomes of previous instances of Branch A? Global Behavior What is the predicted direction of Branch Z given the outcomes of all* previous branches A, B, …, X and Y? * number of previous branches tracked limited by the history length

37 Branch Correlation Code Snippet Branch direction Not independent
if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } 1 (T) 0 (NT) b2 b2 1 1 b3 b3 b3 b3 Path: A:1-1 B:1-0 C:0-1 D:0-0 aa=0 bb=0 aa=0 bb2 aa2 bb=0 aa2 bb2 Branch direction Not independent Correlated to the path taken Example: Path 1-1 of b3 can be surely known beforehand Track path using a 2-bit register

38 Global Branch History Register
Code Snippet An N-bit Shift Register Shift-in branch outcomes 1 taken 0  not taken First-in First-Out BHR can be Global Local (Per-address) if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } Actual T 001 T 011 110 NT 000

39 Local Branch History Register
addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. } Actual T 001 T 011 T 111 T 111 NT 110 T 101 T 011 T 111 T 111 NT 110 000

40 Two-Level Branch Predictor [YehPatt91,92,93]
Pattern History Table (PHT) 00…..00 2N entries 00…..01 Branch History Register (BHR) (Shift left when update) 00…..10 Rc-k Rc-1 1 1 1 N Prediction 11…..10 Current State 11…..11 PHT update Branch History Pattern FSM Update Logic Rc: Actual Branch Outcome Generalized correlated branch predictor 1st level keeps branch history in Branch History Register (BHR) 2nd level segregates pattern history in Pattern History Table (PHT)

41 Correlated Branch Predictor [PanSoRahmeh’92]
2-bit shift register (global branch history) Subsequent branch direction select Branch PC Branch PC 2-bit counter . 2-bit counter . X 2-bit counter . 2-bit counter . X X Prediction Prediction w hash hash . 2w 2-bit counter (2,2) Correlation Scheme 2-bit Sat. Counter Scheme (M,N) correlation scheme M: shift register size (# bits) N: N-bit counter

42 Pattern History Table 2N entries addressed by N-bit BHR
Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter Can be initialized in alternate patterns (01, 10, 01, 10, ..) Alias (or interference) problem

43 Two-Level Branch Prediction
The 2 LSBs are insignificant for 32-bit instruction PHT PC = 0x C . 10 0110 . BHR MSB = 1 Predict Taken

44 PHT Indexing Tradeoff between more history bits and address bits
Branch addr Global history Gselect 4/4 Insufficient History Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries

45 Gshare Branch Predictor [McFarling93]
Branch addr Global history Gselect 4/4 Gshare 8/8 Gselect 4/4: Index PHT by concatenate low order 4 bits Gshare 8/8: Index PHT by {Branch address  Global history} Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries Gshare  Not to lose global history bits Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1

46 Gshare Branch Predictor
PHT PC Address 1 . 00 1 . Global BHR MSB = 0 Predict Not Taken


Download ppt "COSC3330 Computer Architecture Lecture 14. Branch Prediction"

Similar presentations


Ads by Google