COSC3330 Computer Architecture Lecture 14. Branch Prediction

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
Goal: Reduce the Penalty of Control Hazards
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Lecture 3. Branch Prediction Prof. Taeweon Suh Computer Science Education Korea University COM506 Computer Design.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
/ Computer Architecture and Design
Computer Architecture Advanced Branch Prediction
CS5100 Advanced Computer Architecture Advanced Branch Prediction
ELEN 468 Advanced Logic Design
COSC3330 Computer Architecture Lecture 15. Branch Prediction
Out of Order Processors
Dynamic Scheduling and Speculation
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
So far we have dealt with control hazards in instruction pipelines by:
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Lecture: Out-of-order Processors
Adapted from the slides of Prof
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Midterm 2 review Chapter
So far we have dealt with control hazards in instruction pipelines by:
September 20, 2000 Prof. John Kubiatowicz
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Lecture 7 Dynamic Scheduling
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Conceptual execution on a processor which exploits ILP
Computer Structure Advanced Branch Prediction
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

COSC3330 Computer Architecture Lecture 14. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Out-of-Order Execution Branch Prediction Topic Out-of-Order Execution Branch Prediction

Superscalar Terminology Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. Issue Width Number of instructions issued per cycle Out-of-order Able to execute instructions out of program order Register Renaming Able to dynamically assign physical registers to instructions Speculative Execution Able to run instructions speculatively (branch predictions)

A Dynamic Superscalar Processor IF ID RD ( in order ) Dispatch Buffer ( out of order ) ALU FP1 MEM1 BR EX FP2 MEM2 FP3 ( out of order ) Reorder Buffer ( in order ) WB

Remember the Toll Booth? 5s 30s Hands toll-booth agent a $100 bill; takes a while to count the change One-at-a-time = 45s OOO = 30s With a “4-Issue” Toll Booth L1 L2 L3 L4 OOO = Out of Order We’ll add the equivalent of the “shoulder” to the CPU: the Re-Order Buffer (ROB)

Re-Order Buffer (ROB) Separates architected vs. physical registers Tracks program order of all in-flight instructions Enables in-order completion or “commit”

Hardware Organization Instruction Buffers RAT Architected Register File ROB Reservation Stations and ALUs “head” op Qj Qk Vj Vk Add op Qj Qk Vj Vk Mult type dest value fin

Circular Ring Buffer

Stall issue if any needed resource not available Instruction Buffers RAT Architected Register File Read inst from inst buffer Check if resources available: Appropriate RS entry ROB entry Read RAT, read (available) sources, update RAT Write to RS and ROB Reservation Stations and ALUs ROB op Qj Qk Vj Vk Add op Qj Qk Vj Vk Mult Stall issue if any needed resource not available type dest value fin

Exec Same as before Wait for all operands to arrive Compete to use functional unit Execute!

Write Result Broadcast result on CDB (any dependents will grab the value) Write result back to your ROB entry The ARF holds the “official” register state, which we will only update in program order Mark ready/finished bit in ROB (note that this inst has completed execution)

New: Commit When an inst is the oldest in the ROB i.e. ROB-head points to it Write result (if ready/finished bit is set) If register producing instruction: write to architected register file If store: write to memory Advance ROB-head to next instruction This is what the outside world sees And it’s all in-order

Commit Illustrated Make instruction execution “visible” to the outside world “Commit” the changes to the architected state ROB Outside World “sees”: WB result A  ARF A executed B  B executed C  C executed D D executed  E  E executed F  G  H  Instructions execute out of program order, but outside world still “believes” it’s in-order J  K 

James E. Smith Eckert–Mauchly Award 1999 for fundamental contributions to high performance micro-architecture, including saturating counters for branch prediction, reorder buffers for precise exceptions, …

Loose Ends Up to now: Techniques for handling register-related dependencies Register renaming for WAR, WAW Tomasulo’s algorithm for scheduling RAW Still need to address: Control dependencies

Branch Prediction/Speculative Execution When we hit a branch, guess if it’s T or NT ADD A Guess T Branch LOAD DIV ADD Branch SUB STORE SUB LOAD XOR STORE ADD MUL  B Keep scheduling and executing Instructions as if the branch Didn’t even exist T NT C Q  Sometime later, if we messed up… D R  Just throw it all out … …  And fetch the correct instructions

Branches Kill! Branches are very frequent Approx. 20% of all instructions Can not afford waiting until we know where it goes Long pipelines Branch outcome known after B cycles No scheduling past the branch until outcome known Superscalars (e.g. 4-way) Branch every cycle or so! One cycle of work, then bubbles for ~B cycles?

Categorizing Branches Source: H&P using Alpha

Surviving Branches: Prediction Predict Branches And predict them well! Fetch, decode, etc. on the predicted path Option 1: No execute until branch resovled Option 2: Execute anyway (speculation) Recover from mispredictions Restart fetch from correct path A B T NT C Q D R … …

Branch Misprediction Single Issue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue Mispredict

Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue (flush entailed instructions and refetch) Mispredict

Branch Misprediction Single Issue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Resolve Single Issue Mispredict 8-issue Superscalar Processor (Worst case)

Intel Quad Core

A9 (Apple A5)

Importance of Branches Instruction Window for ILP If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is: 5*(1 + ½ + (½)2 + (½)3 + … ) = 10 If we halve the miss rate down to 25%: 5*(1 + ¾ + (¾)2 + (¾)3 + … ) = 20 Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from

Branch Prediction Need to know two things Whether the branch is taken or not (direction) The target address if it is taken (target) Direct jumps, Function calls Direction known (always taken), target easy to compute Conditional Branches (typically PC-relative) Direction difficult to predict, target easy to compute Indirect jumps, function returns Direction known (always taken), target difficult

Branch Prediction: Direction Needed for conditional branches Most branches are of this type Many, many kinds of predictors for this Static: fixed rule, or compiler annotation (e.g. “BEQL” is “branch if equal likely”) Dynamic: hardware prediction Dynamic prediction usually history-based Example: predict direction is the same as the last time this branch was executed

Why Branch Direction is Predictable? if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) …. for (i=0; i<100; i++) { …. } addi r2, r0, 2 bne r10, r2, L_bb xor r10, r10, r10 j L_exit L_bb: bne r11, r2, L_xx xor r11, r11, r11 j L_exit L_xx: beq r10, r11, L_exit … Lexit: addi r10, r0, 100 addi r1, r0, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1

Static Branch Prediction Uni-directional, always predict taken (or not taken) Backward taken, Forward not taken Need offset information Compiler hints with branch annotation When the info will be available? Post-decode?

FSM of the Simplest Predictor A 2-state machine Change mind fast 1 If branch taken If branch not taken Predict not taken 1 Predict taken

Example using 1-bit branch history table addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. }            Pred 1 1 1 1 1 1 1 1 1 Actual T T T T NT T T T T NT T 1 60% accuracy

2-bit Saturating Up/Down Counter Predictor MSB: Direction bit LSB: Hysteresis bit 01/ WN 00/ SN 10/ WT 11/ ST Taken Not Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Predict Not taken Predict taken

2-bit Counter Predictor (Another Scheme) 01/ WN 00/ SN 11/ ST 10/ WT Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Not Taken Predict Not taken Predict taken

Example using 2-bit up/down counter addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. }            Pred 01 10 11 11 11 10 11 11 11 11 10 1 Actual T T T T NT T T T T NT T 01/ WN 00/ SN 10/ WT 11/ ST 80% accuracy

Bimodal Branch Prediction PC Address 2N entries (each entry has a 2 bit counter) 1 . . . . . N bits . table update 2N entries addressed by N-bit PC Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter FSM Update Logic Actual outcome Prediction

Global vs. Local Branch History Local Behavior What is the predicted direction of Branch A given the outcomes of previous instances of Branch A? Global Behavior What is the predicted direction of Branch Z given the outcomes of all* previous branches A, B, …, X and Y? * number of previous branches tracked limited by the history length

Branch Correlation Code Snippet Branch direction Not independent if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } 1 (T) 0 (NT) b2 b2 1 1 b3 b3 b3 b3 Path: A:1-1 B:1-0 C:0-1 D:0-0 aa=0 bb=0 aa=0 bb2 aa2 bb=0 aa2 bb2 Branch direction Not independent Correlated to the path taken Example: Path 1-1 of b3 can be surely known beforehand Track path using a 2-bit register

Global Branch History Register Code Snippet An N-bit Shift Register Shift-in branch outcomes 1 taken 0  not taken First-in First-Out BHR can be Global Local (Per-address) if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } Actual T 001 T 011 110 NT 000

Local Branch History Register addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. } Actual T 001 T 011 T 111 T 111 NT 110 T 101 T 011 T 111 T 111 NT 110 000

Two-Level Branch Predictor [YehPatt91,92,93] Pattern History Table (PHT) 00…..00 2N entries 00…..01 Branch History Register (BHR) (Shift left when update) 00…..10 Rc-k Rc-1 1 1 . . . . . 1 N Prediction 11…..10 Current State 11…..11 PHT update Branch History Pattern FSM Update Logic Rc: Actual Branch Outcome Generalized correlated branch predictor 1st level keeps branch history in Branch History Register (BHR) 2nd level segregates pattern history in Pattern History Table (PHT)

Correlated Branch Predictor [PanSoRahmeh’92] 2-bit shift register (global branch history) Subsequent branch direction select Branch PC Branch PC 2-bit counter . 2-bit counter . X 2-bit counter . 2-bit counter . X X Prediction Prediction w hash hash . 2w 2-bit counter (2,2) Correlation Scheme 2-bit Sat. Counter Scheme (M,N) correlation scheme M: shift register size (# bits) N: N-bit counter

Pattern History Table 2N entries addressed by N-bit BHR Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter Can be initialized in alternate patterns (01, 10, 01, 10, ..) Alias (or interference) problem

Two-Level Branch Prediction The 2 LSBs are insignificant for 32-bit instruction PHT 00000000 00000001 00000010 PC = 0x4001000C . 00110110 00110110 10 00110111 0110 . BHR 11111101 11111110 11111111 MSB = 1 Predict Taken

PHT Indexing Tradeoff between more history bits and address bits Branch addr Global history Gselect 4/4 00000000 00000001 11111111 11110000 10000000 Insufficient History Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries

Gshare Branch Predictor [McFarling93] Branch addr Global history Gselect 4/4 Gshare 8/8 00000000 00000001 11111111 11110000 10000000 01111111 Gselect 4/4: Index PHT by concatenate low order 4 bits Gshare 8/8: Index PHT by {Branch address  Global history} Tradeoff between more history bits and address bits Too many bits needed in Gselect  sparse table entries Gshare  Not to lose global history bits Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1

Gshare Branch Predictor PHT PC Address 1 . . . . . .  00 1 . . . . . . Global BHR MSB = 0 Predict Not Taken