Computer Structure Advanced Branch Prediction

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Branch prediction Avi Mendelson, 4/ MAMAS – Computer Architecture Branch Prediction Dr. Avi Mendelson Some of the slides were taken from: Dr. Lihu.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
OOOE & Exception © Avi Mendelson 05/ MAMAS – Computer Architecture Out Of Order Execution cont. Lecture #8-9 Dr. Avi Mendelson Alex Gontmakher Some.
Goal: Reduce the Penalty of Control Hazards
Branch Target Buffers BPB: Tag + Prediction
EECC551 - Shaaban #1 lec # 5 Winter Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
Dynamic Branch Prediction
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECC551 - Shaaban #1 lec # 5 Fall Static Conditional Branch Prediction Branch prediction schemes can be classified into static and dynamic.
Computer Architecture 2012 – advanced branch prediction 1 Computer Architecture Advanced Branch Prediction By Dan Tsafrir, 21/5/2012 Presentation based.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.
Computer Structure Advanced Branch Prediction
Computer Architecture 2015 – Advanced Branch Prediction 1 Computer Architecture Advanced Branch Prediction By Yoav Etsion and Dan Tsafrir Presentation.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Dynamic Branch Prediction
CSL718 : Pipelined Processors
Lecture: Out-of-order Processors
COSC6385 Advanced Computer Architecture Lecture 9. Branch Prediction
Data Prefetching Smruti R. Sarangi.
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
Dynamic Branch Prediction
Computer Architecture Advanced Branch Prediction
CS5100 Advanced Computer Architecture Advanced Branch Prediction
COSC3330 Computer Architecture Lecture 15. Branch Prediction
Samira Khan University of Virginia Dec 4, 2017
Morgan Kaufmann Publishers The Processor
CMSC 611: Advanced Computer Architecture
Module 3: Branch Prediction
TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 I1 branch decode exec mem wb bubble
So far we have dealt with control hazards in instruction pipelines by:
CPE 631: Branch Prediction
Lecture: Branch Prediction
Dynamic Branch Prediction
Pipelining and control flow
So far we have dealt with control hazards in instruction pipelines by:
Lecture 10: Branch Prediction and Instruction Delivery
Data Prefetching Smruti R. Sarangi.
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Pipelining: dynamic branch prediction Prof. Eric Rotenberg
Adapted from the slides of Prof
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Computer Structure Pipeline
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

Computer Structure Advanced Branch Prediction Lihu Rappoport and Adi Yoaz

Adding a BTB to the Pipeline Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate 8 50 50 4 taken taken or 0 or 4 jcc 50 8 and ... 50 sub 54 mul 58 add jcc I$ provides the instruction bytes Lookup current IP in I$ and in BTB in parallel BTB provides predicted target and direction

Adding a BTB to the Pipeline Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate 50 taken 8 54 50 jcc or 42 0 or 4 jcc 50 8 and ... 50 sub 54 mul 58 add sub

Adding a BTB to the Pipeline Issue flush in case of mismatch Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate Along with the repair IP Verify target (if taken) Verify direction 50 taken 8 50 taken 58 54 sub jcc or 0 or 4 jcc 50 8 and ... 50 sub 54 mul 58 add mul

Branches and Performance MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions MPI correlates well with performance. For example: MPI = 1% (1 out of 100 instructions @1 out of 20 branches) IPC=2 (IPC is the average number of instructions per cycle), flush penalty of 10 cycles We get: MPI = 1%  flush in every 100 instructions flush in every 50 cycles (since IPC=2), 10 cycles flush penalty every 50 cycles 20% in performance

What/Who/When We Predict/Fix Fetch Decode Execute Target Array Branch type conditional unconditional direct unconditional indirect call return Branch target Fix TA miss Fix wrong direct target Allocate TA Cond. Branch Predictor Predict conditional T/NT on TA miss try to fix Fix Wrong prediction Return Stack Buffer Predict return target Fix TA miss Fix Wrong prediction Indirect Target Array Predict indirect target override TA target on TA miss try to fix Fix Wrong prediction Dec Flush Exe Flush

The Target Array The TA is accessed using the branch address (branch IP) Implemented as an n-way set associative cache The TA predicts the following Instruction is a branch Predicted target Branch type Conditional: jump to target if predict taken Unconditional direct: take target Unconditional direct: if iTA hits, use its target Return: get target from Return Stack Buffer The TA is allocated/updated at EXE Tags are usually partial Trade-off space, can get false hits Few branches aliased to same entry No correctness, only performance Branch IP tag target predicted hit / miss (indicates a branch) type

Conditional Branch Direction Prediction

One-Bit Predictor Bit Array Prediction (at fetch): l.s. bits of Branch IP Prediction (at fetch): previous branch outcome Bit Array Update bit with branch outcome (at execution) Problem: 1-bit predictor has a double mistake in loops Branch Outcome 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Prediction ? 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0

Bimodal (2-bit) Predictor A 2-bit counter avoids the double mistake in glitches Need “more evidence” to change prediction Initial state: weakly-taken (most branches are taken) Update (at execution) Branch was actually taken: increment counter (saturate at 11) Branch was actually not-taken: decrement counter (saturate at 00) Predict according to m.s.bit of counter (0 – NT, 1 – taken) Does not predict well branches with patterns like 010101… 00 SNT taken not-taken not- 01 WNT 10 WT 11 ST Predict taken Predict not-taken

Bimodal Predictor (cont.) Prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome (at execution) l.s. bits of branch IP

Bimodal Predictor - example Br1 prediction Pattern: 1 0 1 0 1 0 counter: 2 3 2 3 2 3 Prediction: 1 1 1 1 1 1 Br2 prediction Pattern: 0 1 0 1 0 1 counter: 2 1 2 1 2 1 Prediction: 1 0 1 0 1 0 Br3 prediction Pattern: 1 1 1 1 1 0 counter: 2 3 3 3 3 3 Code: int n = 6 Loop: …. br1: if (n/2) { … } br2: if ((n+1)/2) { … } n--; br3: JNZ n, Loop

2-Level Branch Prediction Save branch direction history in a Branch History Register A shift-register that saves the last n outcomes of the branch History points to an array of 2n bits Bit pointed by a history specifies the predicted branch direction following that history Example: predicting the pattern 0001 0001 0001 . . . History length n=3 000 001 010 011 100 101 110 111 1 Update at EXE Predict at Fetch 0 0 0 1 0 0 0 1 0 0 0 1 . . .    

Shortest History To Predict a Pattern What is the shortest history needed to perfectly predict the pattern 10011 10011 … in steady state ? A history of length 3 predicts correctly: 10011 10011 100  1 10011 10011 001  1 10011 10011 011  1 10011 10011 111  0 10011 10011 110  0 A history of length 2 is wrong twice per iteration 10011 10011 10  0 10011 10011 00  1 10011 10011 01  1 10011 10011 11  1 10011 10011 11  0

Local History Predictor There could be glitches in the pattern Example: 00001 00001 00001 01001 00001 00001 Use 2-bit saturating counters instead of 1 bit to record outcome: The longer the history Warm-Up is longer Counter array becomes very big (and sparse) history Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome (at execution) BHR

Speculative History Updates Deep pipeline  many cycles between fetch and branch resolution If history is updated only at resolution (branch execute) Future occurrences of the same branch may be fetched before history is updated, and predicted with a stale history Update history speculatively according to prediction As long as the prediction of previous branches is correct, the history used for predicting the next branch is correct as well history Update History with prediction prediction 2-bit-sat counter array Speculative BHR

Speculative History Updates If a branch is mispredicted History continues to be updated until the bad branch gets to execution Branches following it use a wrong history for their prediction But this does not matter, as they will be anyhow flushed Need to recover the branch history to its value before the bad branch Maintain a non-speculative copy of the history, updated at execute In case of misprediction, copy non-speculative history into speculative history history 2-bit-sat counter array Speculative BHR BHR At Fetch: Update History with prediction prediction At mis-prediction: Copy BHR to speculative BHR At Execution: Update counter + BHR with branch outcome

Speculative History Updates The counter array is not updated speculatively Prediction can change only on a misprediction state 01→10 or 10→01 Counter array is too big to be recovered following misprediction Too expensive to maintain both speculative and non-speculative copies of the counter array

Local Predictor: private counter arrays Holding BHRs and counter arrays for many branches: History Cache 2-bit-sat counter arrays prediction = msb of counter tag history Branch IP Update history with branch prediction / branch outcome Update counter with branch outcome Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size) Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6 + 2×26) = 142Kbit

Local Predictor: shared counter arrays Using a single counter array shared by all BHR’s All BHR’s index the same array Branches with similar patterns interfere with each other Interference can be either constructive or destructive prediction = msb of counter Branch IP 2-bit-sat counter array History Cache tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6) + 2×26 = 14.1Kbit

Local Predictor: Lselect Lselect reduces inter-branch-interference in the counter array Counter array becomes m times bigger All Branches with the same m IP l.s.bits interfere with each other prediction = msb of counter Branch IP 2-bit-sat counter array History Cache h h+m m l.s.bits of IP tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

Local Predictor: Lshare Lshare reduces inter-branch-interference in the counter array: maps common patterns in different branches to different counters h h l.s.bits of IP History Cache prediction = msb of counter Branch IP 2-bit-sat counter array tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size

Global Predictor The behavior of some branches is highly correlated if (x < 1) . . . if (x > 1) . . . Using a single Global History Register (GHR) for all branches The prediction of the 2nd if is influenced by the direction of the 1st if History interference between non-correlated might hurt prediction With only a single history, can afford a long history The predictor size: history_size + 2*2 history_size history prediction = msb of counter 2-bit-sat counter array GHR

Global Predictor: Gshare Gshare combines global history information with the branch IP The counter accessed is a function of the global history and of the specific branch being predicted / updated: Following this history, for this specific branch, the branch is taken/NT This turns to be extremely accurate and space-efficient Branch IP history prediction = msb of counter 2-bit-sat counter array GHR

XOR as a Hash Function XOR preserves the original history XORing the history with a specific IP, provides 1-1 projection of the history If we know the IP we can re-create the history (by XORing again) We get exactly the same prediction as without XORing AND for example does not have this quality: for every 0 bit in the IP, the resulting bit is 0, regardless of the respective history bit XORing with different IPs create different History projections Distributing the information per IP XOR is a good distribution function If the original histories have the same number of 0s and 1s, so will the projected histories AND for example does not have this quality: more bits in the projected histories will become 0’s

Why Gshare Works So Well Foil WIP Updating a single history with all branches’ outcomes might look like a mess, but … The numbers of branches active in the program at a given moment is usually small E.g., in case of short loop with no if statements in the loop body There is only one branch in the loop (the loop’s branch) Gshare behaves just like local history Assume 2 branches are active simultaneously: A and B Some of the bits in the GHR are for A, and some for B Due to the XOR with their IP’s, A and B update different counters For a given (local) history of branch A, it sometimes gets to

Foil WIP Assume the following code For i=1 to 1000000 { // jump A For j=1 to 4 { // jump B If (cond) x++; }} // jump C The local patterns A: 0000000000… // A is taken only after i gets to 1000000 B: 00001 00001 … C: c1c2c3c4 // some pattern, depending on cond The Global Pattern 00 0000000100000

Chooser A chooser selects between 2 predictors for each branch Use the predictor that was more correct in the past The chooser array is an array of 2-bit saturating counters, indexed by the Branch IP (similar to the Bimodal array) Updated by which predictor is more correct GHR Counter Arrays history Global prediction Final prediction Bimodal prediction Branch IP Chooser array +1 if Bimodal correct and Global wrong -1 if Bimodal wrong and Global correct

Indirect Branch Target Prediction Indirect branch targets: target given in a register Can have many targets: e.g., a case statement Resolved at execution  high misprediction penalty Used in object-oriented code (C++, Java) A history-based indirect branch target predictor Uses the same global history used for conditional jumps Each Jump takes multiple entries  more expensive than TA  should only be used for jumps with multiple targets Indirect Target Array Branch IP history GHR

Indirect Branch Target Prediction Initially allocate indirect branch only in the Target Array (TA) Many indirect branches have only one target If the TA mispredicts for an indirect jump Allocate an iTA entry with the current target Indexed by IP  Global History Prediction from the iTA is used if TA indicates an indirect jump and iTA hits Target Array Indirect Target Predictor Branch IP Predicted Target TA hit type = indirect iTA hit HIT Global history 

Return Stack Buffer A return instruction is a special case of an indirect branch Each times it jumps to a different target The target is determined by the location of the corresponding Call instruction Maintain a small stack of targets at fetch time When the Target Array predicts a Call Push the address of the instruction which follows the Call into the stack (this is the predicted Return address) When the Target Array predicts a Return Pop a target from the stack and use it as the Return address

Backup

Intel 486, Pentium®, Pentium® II 486 Statically predict Not Taken Pentium® 2-bit saturating counters Pentium® II 2-level, local histories, per-set counters 4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter 9 15 Way 0 Way 1 Target Way 2 Way 3 4 32 counters 128 sets P T V 2 1 LRR Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond

Alpha 264 – LG Chooser Local New entry on the Local stage is allocated on a global stage miss-prediction Chooser state-machines: 2 bit each: one bit saves last time global correct/wrong, and the other bit saves for the local correct/wrong Chooses Local only if local was correct and global was wrong Histories Counters 256 In each entry: 6 bit tag + 10 bit History 1024 IP 3 4 ways Global Counters 4096 GHR 2 12 Chooser Counters 4096 2