Download presentation
Presentation is loading. Please wait.
1
Computer Structure Advanced Branch Prediction
Lihu Rappoport and Adi Yoaz
2
Adding a BTB to the Pipeline
Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate 8 50 50 4 taken taken or 0 or 4 jcc and 50 sub 54 mul 58 add jcc I$ provides the instruction bytes Lookup current IP in I$ and in BTB in parallel BTB provides predicted target and direction
3
Adding a BTB to the Pipeline
Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate 50 taken 8 54 50 jcc or 42 0 or 4 jcc and 50 sub 54 mul 58 add sub
4
Adding a BTB to the Pipeline
Issue flush in case of mismatch Decode Execute Memory Fetch WB calc. target IP Data Cache Register File ALU Instruction src1 Inst. Cache address Sign Ext. src1 data src2 data + 4 src2 dst data BTB Next seq. address Predicted target Repair target Flush and Repair Predicted direction target direction ≠ allocate Along with the repair IP Verify target (if taken) Verify direction 50 taken 8 50 taken 58 54 sub jcc or 0 or 4 jcc and 50 sub 54 mul 58 add mul
5
Branches and Performance
MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions MPI correlates well with performance. For example: MPI = 1% (1 out of 100 out of 20 branches) IPC=2 (IPC is the average number of instructions per cycle), flush penalty of 10 cycles We get: MPI = 1% flush in every 100 instructions flush in every 50 cycles (since IPC=2), 10 cycles flush penalty every 50 cycles 20% in performance
6
What/Who/When We Predict/Fix
Fetch Decode Execute Target Array Branch type conditional unconditional direct unconditional indirect call return Branch target Fix TA miss Fix wrong direct target Allocate TA Cond. Branch Predictor Predict conditional T/NT on TA miss try to fix Fix Wrong prediction Return Stack Buffer Predict return target Fix TA miss Fix Wrong prediction Indirect Target Array Predict indirect target override TA target on TA miss try to fix Fix Wrong prediction Dec Flush Exe Flush
7
The Target Array The TA is accessed using the branch address (branch IP) Implemented as an n-way set associative cache The TA predicts the following Instruction is a branch Predicted target Branch type Conditional: jump to target if predict taken Unconditional direct: take target Unconditional direct: if iTA hits, use its target Return: get target from Return Stack Buffer The TA is allocated/updated at EXE Tags are usually partial Trade-off space, can get false hits Few branches aliased to same entry No correctness, only performance Branch IP tag target predicted hit / miss (indicates a branch) type
8
Conditional Branch Direction Prediction
9
One-Bit Predictor Bit Array Prediction (at fetch):
l.s. bits of Branch IP Prediction (at fetch): previous branch outcome Bit Array Update bit with branch outcome (at execution) Problem: 1-bit predictor has a double mistake in loops Branch Outcome Prediction ?
10
Bimodal (2-bit) Predictor
A 2-bit counter avoids the double mistake in glitches Need “more evidence” to change prediction Initial state: weakly-taken (most branches are taken) Update (at execution) Branch was actually taken: increment counter (saturate at 11) Branch was actually not-taken: decrement counter (saturate at 00) Predict according to m.s.bit of counter (0 – NT, 1 – taken) Does not predict well branches with patterns like … 00 SNT taken not-taken not- 01 WNT 10 WT 11 ST Predict taken Predict not-taken
11
Bimodal Predictor (cont.)
Prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome (at execution) l.s. bits of branch IP
12
Bimodal Predictor - example
Br1 prediction Pattern: counter: Prediction: Br2 prediction Pattern: counter: Prediction: Br3 prediction Pattern: counter: Code: int n = 6 Loop: …. br1: if (n/2) { … } br2: if ((n+1)/2) { … } n--; br3: JNZ n, Loop
13
2-Level Branch Prediction
Save branch direction history in a Branch History Register A shift-register that saves the last n outcomes of the branch History points to an array of 2n bits Bit pointed by a history specifies the predicted branch direction following that history Example: predicting the pattern History length n=3 000 001 010 011 100 101 110 111 1 Update at EXE Predict at Fetch
14
Shortest History To Predict a Pattern
What is the shortest history needed to perfectly predict the pattern … in steady state ? A history of length 3 predicts correctly: 1 1 1 0 0 A history of length 2 is wrong twice per iteration 0 1 1 1 0
15
Local History Predictor
There could be glitches in the pattern Example: Use 2-bit saturating counters instead of 1 bit to record outcome: The longer the history Warm-Up is longer Counter array becomes very big (and sparse) history Update History with branch outcome prediction = msb of counter 2-bit-sat counter array Update counter with branch outcome (at execution) BHR
16
Speculative History Updates
Deep pipeline many cycles between fetch and branch resolution If history is updated only at resolution (branch execute) Future occurrences of the same branch may be fetched before history is updated, and predicted with a stale history Update history speculatively according to prediction As long as the prediction of previous branches is correct, the history used for predicting the next branch is correct as well history Update History with prediction prediction 2-bit-sat counter array Speculative BHR
17
Speculative History Updates
If a branch is mispredicted History continues to be updated until the bad branch gets to execution Branches following it use a wrong history for their prediction But this does not matter, as they will be anyhow flushed Need to recover the branch history to its value before the bad branch Maintain a non-speculative copy of the history, updated at execute In case of misprediction, copy non-speculative history into speculative history history 2-bit-sat counter array Speculative BHR BHR At Fetch: Update History with prediction prediction At mis-prediction: Copy BHR to speculative BHR At Execution: Update counter + BHR with branch outcome
18
Speculative History Updates
The counter array is not updated speculatively Prediction can change only on a misprediction state 01→10 or 10→01 Counter array is too big to be recovered following misprediction Too expensive to maintain both speculative and non-speculative copies of the counter array
19
Local Predictor: private counter arrays
Holding BHRs and counter arrays for many branches: History Cache 2-bit-sat counter arrays prediction = msb of counter tag history Branch IP Update history with branch prediction / branch outcome Update counter with branch outcome Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size) Example: #BHRs = 1024; tag_size=8; history_size=6 size=1024 × ( ×26) = 142Kbit
20
Local Predictor: shared counter arrays
Using a single counter array shared by all BHR’s All BHR’s index the same array Branches with similar patterns interfere with each other Interference can be either constructive or destructive prediction = msb of counter Branch IP 2-bit-sat counter array History Cache tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6 size=1024 × (8 + 6) + 2×26 = 14.1Kbit
21
Local Predictor: Lselect
Lselect reduces inter-branch-interference in the counter array Counter array becomes m times bigger All Branches with the same m IP l.s.bits interfere with each other prediction = msb of counter Branch IP 2-bit-sat counter array History Cache h h+m m l.s.bits of IP tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m
22
Local Predictor: Lshare
Lshare reduces inter-branch-interference in the counter array: maps common patterns in different branches to different counters h h l.s.bits of IP History Cache prediction = msb of counter Branch IP 2-bit-sat counter array tag history Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size
23
Global Predictor The behavior of some branches is highly correlated
if (x < 1) . . . if (x > 1) . . . Using a single Global History Register (GHR) for all branches The prediction of the 2nd if is influenced by the direction of the 1st if History interference between non-correlated might hurt prediction With only a single history, can afford a long history The predictor size: history_size + 2*2 history_size history prediction = msb of counter 2-bit-sat counter array GHR
24
Global Predictor: Gshare
Gshare combines global history information with the branch IP The counter accessed is a function of the global history and of the specific branch being predicted / updated: Following this history, for this specific branch, the branch is taken/NT This turns to be extremely accurate and space-efficient Branch IP history prediction = msb of counter 2-bit-sat counter array GHR
25
XOR as a Hash Function XOR preserves the original history
XORing the history with a specific IP, provides 1-1 projection of the history If we know the IP we can re-create the history (by XORing again) We get exactly the same prediction as without XORing AND for example does not have this quality: for every 0 bit in the IP, the resulting bit is 0, regardless of the respective history bit XORing with different IPs create different History projections Distributing the information per IP XOR is a good distribution function If the original histories have the same number of 0s and 1s, so will the projected histories AND for example does not have this quality: more bits in the projected histories will become 0’s
26
Why Gshare Works So Well
Foil WIP Updating a single history with all branches’ outcomes might look like a mess, but … The numbers of branches active in the program at a given moment is usually small E.g., in case of short loop with no if statements in the loop body There is only one branch in the loop (the loop’s branch) Gshare behaves just like local history Assume 2 branches are active simultaneously: A and B Some of the bits in the GHR are for A, and some for B Due to the XOR with their IP’s, A and B update different counters For a given (local) history of branch A, it sometimes gets to
27
Foil WIP Assume the following code For i=1 to 1000000 { // jump A
For j=1 to 4 { // jump B If (cond) x++; }} // jump C The local patterns A: … // A is taken only after i gets to B: … C: c1c2c3c4 // some pattern, depending on cond The Global Pattern
28
Chooser A chooser selects between 2 predictors for each branch
Use the predictor that was more correct in the past The chooser array is an array of 2-bit saturating counters, indexed by the Branch IP (similar to the Bimodal array) Updated by which predictor is more correct GHR Counter Arrays history Global prediction Final prediction Bimodal prediction Branch IP Chooser array +1 if Bimodal correct and Global wrong -1 if Bimodal wrong and Global correct
29
Indirect Branch Target Prediction
Indirect branch targets: target given in a register Can have many targets: e.g., a case statement Resolved at execution high misprediction penalty Used in object-oriented code (C++, Java) A history-based indirect branch target predictor Uses the same global history used for conditional jumps Each Jump takes multiple entries more expensive than TA should only be used for jumps with multiple targets Indirect Target Array Branch IP history GHR
30
Indirect Branch Target Prediction
Initially allocate indirect branch only in the Target Array (TA) Many indirect branches have only one target If the TA mispredicts for an indirect jump Allocate an iTA entry with the current target Indexed by IP Global History Prediction from the iTA is used if TA indicates an indirect jump and iTA hits Target Array Indirect Target Predictor Branch IP Predicted Target TA hit type = indirect iTA hit HIT Global history
31
Return Stack Buffer A return instruction is a special case of an indirect branch Each times it jumps to a different target The target is determined by the location of the corresponding Call instruction Maintain a small stack of targets at fetch time When the Target Array predicts a Call Push the address of the instruction which follows the Call into the stack (this is the predicted Return address) When the Target Array predicts a Return Pop a target from the stack and use it as the Return address
32
Backup
33
Intel 486, Pentium®, Pentium® II
486 Statically predict Not Taken Pentium® 2-bit saturating counters Pentium® II 2-level, local histories, per-set counters 4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter 9 15 Way 0 Way 1 Target Way 2 Way 3 4 32 counters 128 sets P T V 2 1 LRR Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond
34
Alpha 264 – LG Chooser Local New entry on the Local stage is allocated on a global stage miss-prediction Chooser state-machines: 2 bit each: one bit saves last time global correct/wrong, and the other bit saves for the local correct/wrong Chooses Local only if local was correct and global was wrong Histories Counters 256 In each entry: 6 bit tag + 10 bit History 1024 IP 3 4 ways Global Counters 4096 GHR 2 12 Chooser Counters 4096 2
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.