Download presentation
Presentation is loading. Please wait.
1
Advanced Microarchitecture
Lecture 5: Advanced Fetch
2
Branch Predictions Can Be Wrong
How/When do we detect a misprediction? What do we do about it? resteer fetch to correct address hunt down and squash instructions from the wrong path Lecture 5: Advanced Fetch
3
Example Control Flow br A correct path predicted path B C D E F G
Lecture 5: Advanced Fetch
4
Simple Pipeline Fetch (IF) Decode (ID) Dispatch (DP) Execute (EX) br T
… D B A Multiple speculatively fetched basic blocks may be in-flight at the same time! Mispred Detected Lecture 5: Advanced Fetch
5
In More Detail IF Direction prediction, target prediction ID
We know if branch is return, indirect jump, or phantom branch RAS iBTB Squash instructions in BP and I$-lookup Resteer BP to new target from RAS/iBTB iBTB = indirect branch target buffer (just another BTB, but perhaps indexed with some additional information, e.g., branch history, instead of only the PC). Whether or not you can detect an indirect target misprediction at the time of register read depends on datapath assumptions. To do so, you would have to directly route the predicted target to somewhere near the RF, add a comparator there, and then route out the appropriate signals back to the front-end. It’s probably easier to just unify it all at execution so that direction and target mispredictions share/use the same misprediction recovery logic. DP If indirect target, can potentially read target from RF Squash instructions in BP, I$ and ID Resteer BP to target from RF EX Detect wrong direction, or wrong target (indirect) Squash instructions in BP, I$, ID and DP, plus RS and ROB Resteer BP to correct next PC Lecture 5: Advanced Fetch
6
4 preds corresponding to
Phantom Branches May occur when performing multiple bpreds A B C D 4 preds corresponding to 4 possible branches in the fetch group PC BPred N N T T X Z I$ BR XOR ADD With multiple branch prediction and no pre-decoding, it’s possible (due to aliasing in the predictor(s), partial tags, etc.) that you predict a taken branch when a branch does not even exist in the current fetch group. Fetch: ABCX… (C appears to be a branch) After fetch, we discover C cannot be taken because it is not even a branch! This is a phantom branch. Should have fetched: ABCDZ… Lecture 5: Advanced Fetch
7
Hardware Organization
NPC PC I$ ID is indir is retn uncond br actual target no branch BPred != control BTB Note that the Zesto simulator has all prediction structures in the fetch stage (RAS and iBTB are used in parallel with the bpred and regular BTB… similar to if you assumed the presence of some sort of decode prediction). We’re not entirely sure where each predictor is located in real pipelines, but it’s not too hard to think about what is necessary to make each possibility work. push on call pop on retn RAS + EX sizeof(I$-line) iBTB Lecture 5: Advanced Fetch
8
Recovery Squashing instructions in front-end pipeline IF ID DS EX WXYZ
QRST KLMN mispred! nop EFGH ??? What about insts that are already in the RS, ROB, LSQ? nop nop’s are filtered out – no need to take up RS and ROB entries Lecture 5: Advanced Fetch
9
Wait for Drain Squash in-order front-end (as before)
Stall dispatch (no new instructions ROB, RS) Let OOO engine execute as usual Let commit operate as usual except: check for the mispredicted branch cannot commit any instructions after it but after mispredicted branch committed, any remaining instructions in ROB, RS, LSQ must be on the wrong path flush the OOO engine allow dispatch to continue This is slow, but to the best of my knowledge this is how it is still done in the Intel-family of processors (it definitely was the case for the original P-Pro according to Bob Colwell’s chapter in Shen and Lipasti’s book). Lecture 5: Advanced Fetch
10
Wait for Drain (2) Simple to implement! Performance degradation
What if this load has a cache miss and goes to main memory? Ideal: D&W: LOAD ADD BR junk LOAD LOAD ADD BR XOR SUB ST - - - junk - - - XOR LOAD SUB ST BR ADD BR junk X X junk junk junk junk Lecture 5: Advanced Fetch
11
Branch Tags/IDs/Colors
Each instruction fetched is assigned the “current branch tag” Each predicted branch causes a new branch tag to be allocated (and becomes the current tag) The following slides just discuss possible ways in which one could try to attempt to implement faster recovery mechanisms. I’m not aware of any processors that have actually used these. Tags not being in order could just be due to how the tags are recycled and reassigned. branch ROB Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 (Tags might not necessarily be in any particular order) Lecture 5: Advanced Fetch
12
Branch Tags (2) 7 5 3 mispred! ROB Tags Tag List 1 2 4 7 5 3 1 1 1 1
You only broadcast any tags after the mispredicted branch. Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 Tag List 1 2 4 7 5 3 Lecture 5: Advanced Fetch
13
Overkill for ROB / LSQ ROB and LSQ keep instructions in program order (more on this in future lecture) All instruction physically after the mispredicted branch should be squashed … Simple! Some sort of tagging/coloring useful for RS! instructions in RS may be in arbitrary order may be multiple sets of RS’s ROB Even the “simple” squashing isn’t necessarily that trivial because the buffers are organized as circular queues, and so the circuitry has to properly handle wrap-around. Integer RS FP RS Lecture 5: Advanced Fetch
14
Overall area overhead is
Hardware Complexity invalidate tag 0 Height increases with num branch tags invalidate tag 1 invalidate tag 2 my tag Width increases with num branch tags = = = squash Overall area overhead is quadratic in tag count Lecture 5: Advanced Fetch
15
Simplifications For a ROB with n entries, could potentially have n different branches, each requiring a unique tag In practice, only a fraction of insts are branches, so limit to k < n tags instead If a k+1st branch is fetched, dispatch must be stalled until a tag has been deallocated Worst case: consider if the oldest branch mispredicts and so all subsequent tags have to be broadcasted… now what if all subsequent instructions are branches (one tag each)! That’s a huge amount of broadcasts. Limits on the number of in-flight branches may be needed due to constraints on other resources (e.g., state for recovering speculative branch history register(s) after mispredictions). Lecture 5: Advanced Fetch
16
Simplifications (2) For k tags, may need to broadcast all if oldest branch mispredicted, resulting in O(k2) overhead Limit to only one (for example) broadcast per cycle In this example, recovery takes three cycles instead of one. Note that it’s not necessary to recover in a single cycle… to avoid any performance penalties, you only need to have recovered by the time the newly fetched correct-path instructions make it to the dispatch point in the processor (i.e., if the processor front-end takes 5 cycles from fetch to dispatch, then recovering in 2 cycles doesn’t really buy you anything more than if you took 4 cycles to recover). 7 5 3 Resume Fetch Lecture 5: Advanced Fetch
17
Branch Predictor Latency
To provide a continuous stream of instructions, the branch predictor must make one prediction every cycle Pipelining? Nope. If current prediction is NT, then next PC is A. If taken, then next PC is B. A dependency exists between successive predictions Limits predictor size/latency Smaller predictor is less accurate Or clock frequency penalty Lecture 5: Advanced Fetch
18
Ahead Prediction Normally: Instead: PC1 PC2 PC3 PC4 PC5 …
Each “” is a prediction that takes a single cycle PCi is predicted from PCi-1 Instead: PC1 PC3 PC5 … and PC2 PC4 … PCi is predicted from PCi-2, and so the prediction can take two cycles instead of one In general, can k-ahead pipeline the predictor Lecture 5: Advanced Fetch
19
Ahead Prediction Timing
2-cycle ahead-pipelined branch predictor Fetch Address PCi PCi+2 Cycle k PCi-1 PCi PCi+1 PCi+3 Cycle k+1 PCi PCi+1 PCi+2 PCi+4 Cycle k+2 PCi+1 PCi+2 Lecture 5: Advanced Fetch
20
Ahead Prediction Misprediction
The address before NPC is the PC of the mispredicted branch PC PCwrong New PC sent to front-end PC NPC mispredict! ??? NPC Cycle k: mispredict Cycle k+1: NPC I$, PC predictor - - - PC Cycle k+2: I$ bubble, NPC predictor PC NPC PC next-next PC (N2PC) Cycle k+3: N2PCI$, N2PC predictor NPC N2PC Lecture 5: Advanced Fetch
21
Overriding Branch Predictors
Use two branch predictors 1st one has single-cycle latency (fast, medium accuracy) 2nd one has multi-cycle latency, but more accurate Second predictor can override the 1st prediction if it disagrees Idea: better to pay for a small number of bubbles (difference in 1st and 2nd predictor latencies) than to pay for a full branch misprediction (full pipeline flush, 20+ cycles of delay) Lecture 5: Advanced Fetch
22
Overriding Predictors (2)
Z A B Predict A Predict A’ Predict B A’ B’ Fetch A Predict C B’ A’ C’ Fetch B Fetch A Fast 1st Pred 2-cycle Pipelined I$ Slower 2nd Pred If A != A’, flush A, B andC restart fetch with A’ If A=A’ (both preds agree), done Lecture 5: Advanced Fetch
23
Benefit of Overriding Predictors
Assume 1-cycle predictor 80% accuracy 3-cycle predictor 95% accuracy Misprediction penalty of 20 cycles Fetch bubbles per branch 1-cycle pred only: 0.8 20 = 4 3-cycle pred only: 0.95 20 = 3.85 Overriding config: 0.80.950 + 0.20.953 + 0.20.0520 + 0.80.0523 = 1.69 Fetch bubbles per branch: lower is better Worst case, branch mispred penalty is worse than without overriding predictors! Lecture 5: Advanced Fetch
24
Speculative Branch Update
Ideal branch prediction problem Given PC, predict branch outcome Given actual outcome, update/train predictor Repeat Actual problem Streams of predictions and updates in parallel A Predict: B C D E F G Update: time Lecture 5: Advanced Fetch
25
Speculative Branch Update (2)
BHR update cannot be delayed until branch retirement Predict: A B C D E F G Update: A B C D E F G Branch F would also likely use a stale BHR value unless the update from A happens at the very start of the cycle. Can’t update BHR until commit because outcome not known until then BHR: 011010 011010 011010 011010 011010 110101 Branches B-E all predicted with The same stale BHR value Lecture 5: Advanced Fetch
26
Speculative Branch Update (3)
Update branch history using predictions Speculative update If predictions are correct, then BHR is correct Effectively simulates alternating lookup and update w.r.t. the BHR So what if there’s a misprediction? Checkpoint and recover Lecture 5: Advanced Fetch
27
Recovery of Speculative BHR
BPred Lookup … Recovery during commit/retirement of the mispredicted branch. Speculative BHR BPred Update Retirement Mispredict! Retirement BHR Lecture 5: Advanced Fetch
28
Execution-Time Recovery
Commit-time recovery may substantially delay branch misprediction recovery $-miss to DRAM Have every branch checkpoint the BHR at the time it predicted On mispredict, recover the speculative BHR from this checkpoint Load The figure on the right is just showing an example where waiting until commit to recover can cost you a lot of time. Br Executed, but can’t recover until load retires Lecture 5: Advanced Fetch
29
Observed paths through the program
Traces A “Trace” is a dynamic stream of instructions Example Traces A B C H I J K L G A B C D Done with recovery… moving on to other topics. E F G H I J K L Static Layout Observed paths through the program Lecture 5: Advanced Fetch
30
Trace Cache Idea is to cache dynamic Traces instead of static instructions A B C D E F G H I J Trace Cache E F G A B C D E F G H I J I$ Fetch (5 cycles) H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$ Lecture 5: Advanced Fetch
31
Hardware Organization
Fetch Address Insts Tag, etc. Hit Logic Trace Cache I$ BPred BTB Line-Fill Buffer Fill Control Merge Logic BTB Logic Mask, exchange, Shift instruction latch to decoder Lecture 5: Advanced Fetch
32
Tags, etc. Tag Fall-thru Addr 3rd branch # Br. 2nd branch Target Addr
1st branch Branch Mask A 3 11,1 X Y Fetch: A Branches 1&2 both Taken in this trace Trace ends in a branch Lecture 5: Advanced Fetch
33
Hit Logic, Next Address Selection
Fetch: A Tag # Br. Mask Fall-thru Target Multi-BPred T Cond. AND Match Remaining Block(s) Trace hit = A 3 11,1 X Y N 0 1 Next Fetch Address = Making multiple branch predictions discussed on next slide Match 1st Block Lecture 5: Advanced Fetch
34
Generating Multiple Predictions
BHR BPred BPred BPred Serialized access: incredibly slow Three predictions in parallel Predictor must be BHR-based only (no PC bits!) Lecture 5: Advanced Fetch
35
Associativity Set-Associativity Path/Trace-Associativity ABC ABC XYZ
Benefit: reduced miss rate Cost: access time, replacement complexity Path/Trace-Associativity Benefit: possible reduced miss rate, trace-thrashing Cost: access time, replacement complexity, code duplication ABC ABD A B D A B C Lecture 5: Advanced Fetch
36
Indexing T$ A B C D X Y Works if path after AB consistently correlates
with path before AB A A B C A BHR bits B C D Provides similar benefit to path-assoc. Lecture 5: Advanced Fetch
37
Trace Fill Unit Placement
Build Trace at Fetch ROB trace construction buffer Instructions from Retire T$ Store when trace complete Build Trace at Retire I$ To Decode trace construction buffer T$ Store when trace complete Lecture 5: Advanced Fetch
38
Trace Fill Unit Placement (2)
At Fetch Speculative traces (uses branch prediction - not verified) Construction buffer management Building ABC, detect mispredict should be ABD; need to find C in the buffer, clean it out, and then insert D At Retire Non-speculative, all traces are “correct” No interaction with branch predictor Simpler construction buffer Slower response time Time from fetching ABC retiring ABC may be long Until retirement, ABC not in T$ and fetch must use I$ Lecture 5: Advanced Fetch
39
Trace Selection Some traces may have poor temporal locality
Storing ACD evicts ABD (assuming no path-assoc), but likely won’t be useful Alternative, use a trace filtering mechanism extra HW required A 97% 3% B C D Lecture 5: Advanced Fetch
40
Statistical Filtering [PACT 2005]
For each trace, insert with probability p < 1.0 Example: p=0.05 (5% chance of insertion per trace) Hot trace: ABC, seen 50 times Cold trace: XYZ, seen twice Probability of ABC getting inserted 1.0 – P(not getting inserted) = 1.0 – ( )50 = 1.0 – = 92.3% (good chance that ABC gets in the T$) Probability of XYZ getting inserted 1.0 – ( )2 = = 9.75% (not so likely) Lecture 5: Advanced Fetch
41
Partial Matches Fetch: A Trace $ BPred I$ ABC ABD A =
Partial Hit? = Benefit: More insts Cost: More complex “hit” logic Squashing logic ABC AB Targets for intermediate branches AB A Lecture 5: Advanced Fetch
42
Netburst (P4) Trace Cache
Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Lecture 5: Advanced Fetch
43
Trace Prediction Each trace has a unique identifier, analogous to but different from a conventional PC effectively starting PC plus intra-trace branch directions Trace predictor takes a trace-id as input, and outputs a predicted next-trace-id Trace cache is indexed with the trace-id, tag match against trace-id as well Lecture 5: Advanced Fetch
44
No I$, Decoded Trace Cache
No I$ means T$ miss must pay the latency for an L2 access Severe performance penalty for applications with poor trace locality Decoded instructions remove decode logic from branch misprediction penalty Misp Fetch Fetch Dec Dec Ren Disp Exec Mispredict Penalty Misp T$ T$ Ren Disp Exec Lecture 5: Advanced Fetch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.