Advanced Microarchitecture

Advanced Microarchitecture
Lecture 5: Advanced Fetch

Branch Predictions Can Be Wrong
How/When do we detect a misprediction? What do we do about it? resteer fetch to correct address hunt down and squash instructions from the wrong path Lecture 5: Advanced Fetch

Example Control Flow br A correct path predicted path B C D E F G
Lecture 5: Advanced Fetch

Simple Pipeline Fetch (IF) Decode (ID) Dispatch (DP) Execute (EX) br T
… D B A Multiple speculatively fetched basic blocks may be in-flight at the same time! Mispred Detected Lecture 5: Advanced Fetch

In More Detail IF Direction prediction, target prediction ID
We know if branch is return, indirect jump, or phantom branch RAS iBTB Squash instructions in BP and I$-lookup Resteer BP to new target from RAS/iBTB iBTB = indirect branch target buffer (just another BTB, but perhaps indexed with some additional information, e.g., branch history, instead of only the PC). Whether or not you can detect an indirect target misprediction at the time of register read depends on datapath assumptions. To do so, you would have to directly route the predicted target to somewhere near the RF, add a comparator there, and then route out the appropriate signals back to the front-end. It’s probably easier to just unify it all at execution so that direction and target mispredictions share/use the same misprediction recovery logic. DP If indirect target, can potentially read target from RF Squash instructions in BP, I$ and ID Resteer BP to target from RF EX Detect wrong direction, or wrong target (indirect) Squash instructions in BP, I$, ID and DP, plus RS and ROB Resteer BP to correct next PC Lecture 5: Advanced Fetch

4 preds corresponding to
Phantom Branches May occur when performing multiple bpreds A B C D 4 preds corresponding to 4 possible branches in the fetch group PC BPred N N T T X Z I$ BR XOR ADD With multiple branch prediction and no pre-decoding, it’s possible (due to aliasing in the predictor(s), partial tags, etc.) that you predict a taken branch when a branch does not even exist in the current fetch group. Fetch: ABCX… (C appears to be a branch) After fetch, we discover C cannot be taken because it is not even a branch! This is a phantom branch. Should have fetched: ABCDZ… Lecture 5: Advanced Fetch

Hardware Organization
NPC PC I$ ID is indir is retn uncond br actual target no branch BPred != control BTB Note that the Zesto simulator has all prediction structures in the fetch stage (RAS and iBTB are used in parallel with the bpred and regular BTB… similar to if you assumed the presence of some sort of decode prediction). We’re not entirely sure where each predictor is located in real pipelines, but it’s not too hard to think about what is necessary to make each possibility work. push on call pop on retn RAS + EX sizeof(I$-line) iBTB Lecture 5: Advanced Fetch

Recovery Squashing instructions in front-end pipeline IF ID DS EX WXYZ
QRST KLMN mispred! nop EFGH ??? What about insts that are already in the RS, ROB, LSQ? nop nop’s are filtered out – no need to take up RS and ROB entries Lecture 5: Advanced Fetch

Wait for Drain Squash in-order front-end (as before)
Stall dispatch (no new instructions  ROB, RS) Let OOO engine execute as usual Let commit operate as usual except: check for the mispredicted branch cannot commit any instructions after it but after mispredicted branch committed, any remaining instructions in ROB, RS, LSQ must be on the wrong path flush the OOO engine allow dispatch to continue This is slow, but to the best of my knowledge this is how it is still done in the Intel-family of processors (it definitely was the case for the original P-Pro according to Bob Colwell’s chapter in Shen and Lipasti’s book). Lecture 5: Advanced Fetch

Wait for Drain (2) Simple to implement! Performance degradation
What if this load has a cache miss and goes to main memory? Ideal: D&W: LOAD ADD BR junk LOAD LOAD ADD BR XOR SUB ST  - - - junk - - - XOR LOAD SUB ST BR ADD BR junk X X junk junk junk junk Lecture 5: Advanced Fetch

Branch Tags/IDs/Colors
Each instruction fetched is assigned the “current branch tag” Each predicted branch causes a new branch tag to be allocated (and becomes the current tag) The following slides just discuss possible ways in which one could try to attempt to implement faster recovery mechanisms. I’m not aware of any processors that have actually used these. Tags not being in order could just be due to how the tags are recycled and reassigned. branch ROB Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 (Tags might not necessarily be in any particular order) Lecture 5: Advanced Fetch

Branch Tags (2)  7 5 3 mispred! ROB Tags Tag List 1 2 4 7 5 3 1 1 1 1
You only broadcast any tags after the mispredicted branch. Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 Tag List 1 2 4 7 5 3 Lecture 5: Advanced Fetch

Overkill for ROB / LSQ ROB and LSQ keep instructions in program order (more on this in future lecture) All instruction physically after the mispredicted branch should be squashed … Simple! Some sort of tagging/coloring useful for RS! instructions in RS may be in arbitrary order may be multiple sets of RS’s ROB Even the “simple” squashing isn’t necessarily that trivial because the buffers are organized as circular queues, and so the circuitry has to properly handle wrap-around. Integer RS FP RS Lecture 5: Advanced Fetch

Overall area overhead is
Hardware Complexity invalidate tag 0 Height increases with num branch tags invalidate tag 1 invalidate tag 2 my tag Width increases with num branch tags = = = squash Overall area overhead is quadratic in tag count Lecture 5: Advanced Fetch

Simplifications For a ROB with n entries, could potentially have n different branches, each requiring a unique tag In practice, only a fraction of insts are branches, so limit to k < n tags instead If a k+1st branch is fetched, dispatch must be stalled until a tag has been deallocated Worst case: consider if the oldest branch mispredicts and so all subsequent tags have to be broadcasted… now what if all subsequent instructions are branches (one tag each)! That’s a huge amount of broadcasts. Limits on the number of in-flight branches may be needed due to constraints on other resources (e.g., state for recovering speculative branch history register(s) after mispredictions). Lecture 5: Advanced Fetch

Simplifications (2) For k tags, may need to broadcast all if oldest branch mispredicted, resulting in O(k2) overhead Limit to only one (for example) broadcast per cycle In this example, recovery takes three cycles instead of one. Note that it’s not necessary to recover in a single cycle… to avoid any performance penalties, you only need to have recovered by the time the newly fetched correct-path instructions make it to the dispatch point in the processor (i.e., if the processor front-end takes 5 cycles from fetch to dispatch, then recovering in 2 cycles doesn’t really buy you anything more than if you took 4 cycles to recover). 7  5  3  Resume Fetch Lecture 5: Advanced Fetch

Branch Predictor Latency
To provide a continuous stream of instructions, the branch predictor must make one prediction every cycle Pipelining? Nope. If current prediction is NT, then next PC is A. If taken, then next PC is B. A dependency exists between successive predictions Limits predictor size/latency Smaller predictor is less accurate Or clock frequency penalty Lecture 5: Advanced Fetch

Ahead Prediction Normally: Instead: PC1  PC2  PC3  PC4  PC5  …
Each “” is a prediction that takes a single cycle PCi is predicted from PCi-1 Instead: PC1  PC3  PC5  … and PC2  PC4  … PCi is predicted from PCi-2, and so the prediction can take two cycles instead of one In general, can k-ahead pipeline the predictor Lecture 5: Advanced Fetch

Ahead Prediction Timing
2-cycle ahead-pipelined branch predictor Fetch Address PCi PCi+2 Cycle k PCi-1 PCi PCi+1 PCi+3 Cycle k+1 PCi PCi+1 PCi+2 PCi+4 Cycle k+2 PCi+1 PCi+2 Lecture 5: Advanced Fetch

Ahead Prediction Misprediction
The address before NPC is the PC of the mispredicted branch PC  PCwrong New PC sent to front-end PC  NPC mispredict! ??? NPC Cycle k: mispredict Cycle k+1: NPC  I$, PC  predictor - - - PC Cycle k+2: I$ bubble, NPC  predictor PC NPC PC  next-next PC (N2PC) Cycle k+3: N2PCI$, N2PC  predictor NPC N2PC Lecture 5: Advanced Fetch

Overriding Branch Predictors
Use two branch predictors 1st one has single-cycle latency (fast, medium accuracy) 2nd one has multi-cycle latency, but more accurate Second predictor can override the 1st prediction if it disagrees Idea: better to pay for a small number of bubbles (difference in 1st and 2nd predictor latencies) than to pay for a full branch misprediction (full pipeline flush, 20+ cycles of delay) Lecture 5: Advanced Fetch

Overriding Predictors (2)
Z A B Predict A Predict A’ Predict B A’ B’ Fetch A Predict C B’ A’ C’ Fetch B Fetch A Fast 1st Pred 2-cycle Pipelined I$ Slower 2nd Pred If A != A’, flush A, B andC restart fetch with A’ If A=A’ (both preds agree), done Lecture 5: Advanced Fetch

Benefit of Overriding Predictors
Assume 1-cycle predictor 80% accuracy 3-cycle predictor 95% accuracy Misprediction penalty of 20 cycles Fetch bubbles per branch 1-cycle pred only: 0.8 20 = 4 3-cycle pred only: 0.95 20 = 3.85 Overriding config: 0.80.950 + 0.20.953 + 0.20.0520 + 0.80.0523 = 1.69 Fetch bubbles per branch: lower is better Worst case, branch mispred penalty is worse than without overriding predictors! Lecture 5: Advanced Fetch

Speculative Branch Update
Ideal branch prediction problem Given PC, predict branch outcome Given actual outcome, update/train predictor Repeat Actual problem Streams of predictions and updates in parallel A Predict: B C D E F G Update: time Lecture 5: Advanced Fetch

Speculative Branch Update (2)
BHR update cannot be delayed until branch retirement Predict: A B C D E F G Update: A B C D E F G Branch F would also likely use a stale BHR value unless the update from A happens at the very start of the cycle. Can’t update BHR until commit because outcome not known until then BHR: 011010 011010 011010 011010 011010 110101 Branches B-E all predicted with The same stale BHR value Lecture 5: Advanced Fetch

Speculative Branch Update (3)
Update branch history using predictions Speculative update If predictions are correct, then BHR is correct Effectively simulates alternating lookup and update w.r.t. the BHR So what if there’s a misprediction? Checkpoint and recover Lecture 5: Advanced Fetch

Recovery of Speculative BHR
BPred Lookup … Recovery during commit/retirement of the mispredicted branch. Speculative BHR BPred Update Retirement Mispredict! Retirement BHR Lecture 5: Advanced Fetch

Execution-Time Recovery
Commit-time recovery may substantially delay branch misprediction recovery $-miss to DRAM Have every branch checkpoint the BHR at the time it predicted On mispredict, recover the speculative BHR from this checkpoint Load The figure on the right is just showing an example where waiting until commit to recover can cost you a lot of time. Br Executed, but can’t recover until load retires Lecture 5: Advanced Fetch

Observed paths through the program
Traces A “Trace” is a dynamic stream of instructions Example Traces A B C H I J K L G A B C D Done with recovery… moving on to other topics. E F G H I J K L Static Layout Observed paths through the program Lecture 5: Advanced Fetch

Trace Cache Idea is to cache dynamic Traces instead of static instructions A B C D E F G H I J Trace Cache E F G A B C D E F G H I J I$ Fetch (5 cycles) H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$ Lecture 5: Advanced Fetch

Hardware Organization
Fetch Address Insts Tag, etc. Hit Logic Trace Cache I$ BPred BTB Line-Fill Buffer Fill Control Merge Logic BTB Logic Mask, exchange, Shift instruction latch to decoder Lecture 5: Advanced Fetch

Tags, etc. Tag Fall-thru Addr 3rd branch # Br. 2nd branch Target Addr
1st branch Branch Mask A 3 11,1 X Y Fetch: A Branches 1&2 both Taken in this trace Trace ends in a branch Lecture 5: Advanced Fetch

Hit Logic, Next Address Selection
Fetch: A Tag # Br. Mask Fall-thru Target Multi-BPred T Cond. AND Match Remaining Block(s) Trace hit = A 3 11,1 X Y N 0 1 Next Fetch Address = Making multiple branch predictions discussed on next slide Match 1st Block Lecture 5: Advanced Fetch

Generating Multiple Predictions
BHR BPred BPred BPred Serialized access: incredibly slow Three predictions in parallel Predictor must be BHR-based only (no PC bits!) Lecture 5: Advanced Fetch

Associativity Set-Associativity Path/Trace-Associativity ABC ABC XYZ
Benefit: reduced miss rate Cost: access time, replacement complexity Path/Trace-Associativity Benefit: possible reduced miss rate, trace-thrashing Cost: access time, replacement complexity, code duplication ABC ABD A B D A B C Lecture 5: Advanced Fetch

Indexing T$ A B C D X Y Works if path after AB consistently correlates
with path before AB A A B C A BHR bits B C D Provides similar benefit to path-assoc. Lecture 5: Advanced Fetch

Trace Fill Unit Placement
Build Trace at Fetch ROB trace construction buffer Instructions from Retire T$ Store when trace complete Build Trace at Retire I$ To Decode trace construction buffer T$ Store when trace complete Lecture 5: Advanced Fetch

Trace Fill Unit Placement (2)
At Fetch Speculative traces (uses branch prediction - not verified) Construction buffer management Building ABC, detect mispredict  should be ABD; need to find C in the buffer, clean it out, and then insert D At Retire Non-speculative, all traces are “correct” No interaction with branch predictor Simpler construction buffer Slower response time Time from fetching ABC  retiring ABC may be long Until retirement, ABC not in T$ and fetch must use I$ Lecture 5: Advanced Fetch

Trace Selection Some traces may have poor temporal locality
Storing ACD evicts ABD (assuming no path-assoc), but likely won’t be useful Alternative, use a trace filtering mechanism extra HW required A 97% 3% B C D Lecture 5: Advanced Fetch

Statistical Filtering [PACT 2005]
For each trace, insert with probability p < 1.0 Example: p=0.05 (5% chance of insertion per trace) Hot trace: ABC, seen 50 times Cold trace: XYZ, seen twice Probability of ABC getting inserted 1.0 – P(not getting inserted) = 1.0 – ( )50 = 1.0 – = 92.3% (good chance that ABC gets in the T$) Probability of XYZ getting inserted 1.0 – ( )2 = = 9.75% (not so likely) Lecture 5: Advanced Fetch

Partial Matches Fetch: A Trace $ BPred I$ ABC ABD A =
Partial Hit? = Benefit: More insts Cost: More complex “hit” logic Squashing logic ABC  AB Targets for intermediate branches AB A Lecture 5: Advanced Fetch

Netburst (P4) Trace Cache
Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Lecture 5: Advanced Fetch

Trace Prediction Each trace has a unique identifier, analogous to but different from a conventional PC effectively starting PC plus intra-trace branch directions Trace predictor takes a trace-id as input, and outputs a predicted next-trace-id Trace cache is indexed with the trace-id, tag match against trace-id as well Lecture 5: Advanced Fetch

No I$, Decoded Trace Cache
No I$ means T$ miss must pay the latency for an L2 access Severe performance penalty for applications with poor trace locality Decoded instructions remove decode logic from branch misprediction penalty Misp Fetch Fetch Dec Dec Ren Disp Exec Mispredict Penalty Misp T$ T$ Ren Disp Exec Lecture 5: Advanced Fetch

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback