Advanced Microarchitecture

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

CS6290 Pentiums. Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Goal: Reduce the Penalty of Control Hazards

Branch Target Buffers BPB: Tag + Prediction

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

CMPE 421 Parallel Computer Architecture

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Advanced Microarchitecture

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

CS 352H: Computer Systems Architecture

Prof. Hsien-Hsin Sean Lee

Computer Organization CS224

Data Prefetching Smruti R. Sarangi.

Computer Structure Advanced Branch Prediction

Computer Architecture Advanced Branch Prediction

PowerPC 604 Superscalar Microprocessor

5.2 Eleven Advanced Optimizations of Cache Performance

Appendix C Pipeline implementation

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.

Pipelining: Advanced ILP

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

Pipelining Chapter 6.

The processor: Pipelining and Branching

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

How to improve (decrease) CPI

Advanced Computer Architecture

Control unit extension for data hazards

Lecture 10: Branch Prediction and Instruction Delivery

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

Data Prefetching Smruti R. Sarangi.

Control unit extension for data hazards

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Patrick Akl and Andreas Moshovos AENAO Research Group

Control unit extension for data hazards

Computer Structure Advanced Branch Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Advanced Microarchitecture Lecture 5: Advanced Fetch

Branch Predictions Can Be Wrong How/When do we detect a misprediction? What do we do about it? resteer fetch to correct address hunt down and squash instructions from the wrong path Lecture 5: Advanced Fetch

Example Control Flow br A correct path predicted path B C D E F G Lecture 5: Advanced Fetch

Simple Pipeline Fetch (IF) Decode (ID) Dispatch (DP) Execute (EX) br T … D B A Multiple speculatively fetched basic blocks may be in-flight at the same time! Mispred Detected Lecture 5: Advanced Fetch

In More Detail IF Direction prediction, target prediction ID We know if branch is return, indirect jump, or phantom branch RAS iBTB Squash instructions in BP and I$-lookup Resteer BP to new target from RAS/iBTB iBTB = indirect branch target buffer (just another BTB, but perhaps indexed with some additional information, e.g., branch history, instead of only the PC). Whether or not you can detect an indirect target misprediction at the time of register read depends on datapath assumptions. To do so, you would have to directly route the predicted target to somewhere near the RF, add a comparator there, and then route out the appropriate signals back to the front-end. It’s probably easier to just unify it all at execution so that direction and target mispredictions share/use the same misprediction recovery logic. DP If indirect target, can potentially read target from RF Squash instructions in BP, I$ and ID Resteer BP to target from RF EX Detect wrong direction, or wrong target (indirect) Squash instructions in BP, I$, ID and DP, plus RS and ROB Resteer BP to correct next PC Lecture 5: Advanced Fetch

4 preds corresponding to Phantom Branches May occur when performing multiple bpreds A B C D 4 preds corresponding to 4 possible branches in the fetch group PC BPred N N T T X Z I$ BR XOR ADD With multiple branch prediction and no pre-decoding, it’s possible (due to aliasing in the predictor(s), partial tags, etc.) that you predict a taken branch when a branch does not even exist in the current fetch group. Fetch: ABCX… (C appears to be a branch) After fetch, we discover C cannot be taken because it is not even a branch! This is a phantom branch. Should have fetched: ABCDZ… Lecture 5: Advanced Fetch

Hardware Organization NPC PC I$ ID is indir is retn uncond br actual target no branch BPred != control BTB Note that the Zesto simulator has all prediction structures in the fetch stage (RAS and iBTB are used in parallel with the bpred and regular BTB… similar to if you assumed the presence of some sort of decode prediction). We’re not entirely sure where each predictor is located in real pipelines, but it’s not too hard to think about what is necessary to make each possibility work. push on call pop on retn RAS + EX sizeof(I$-line) iBTB Lecture 5: Advanced Fetch

Recovery Squashing instructions in front-end pipeline IF ID DS EX WXYZ QRST KLMN mispred! nop EFGH ??? What about insts that are already in the RS, ROB, LSQ? nop nop’s are filtered out – no need to take up RS and ROB entries Lecture 5: Advanced Fetch

Wait for Drain Squash in-order front-end (as before) Stall dispatch (no new instructions  ROB, RS) Let OOO engine execute as usual Let commit operate as usual except: check for the mispredicted branch cannot commit any instructions after it but after mispredicted branch committed, any remaining instructions in ROB, RS, LSQ must be on the wrong path flush the OOO engine allow dispatch to continue This is slow, but to the best of my knowledge this is how it is still done in the Intel-family of processors (it definitely was the case for the original P-Pro according to Bob Colwell’s chapter in Shen and Lipasti’s book). Lecture 5: Advanced Fetch

Wait for Drain (2) Simple to implement! Performance degradation What if this load has a cache miss and goes to main memory? Ideal: D&W: LOAD ADD BR junk LOAD LOAD ADD BR XOR SUB ST  - - - junk - - - XOR LOAD SUB ST BR ADD BR junk X X junk junk junk junk Lecture 5: Advanced Fetch

Branch Tags/IDs/Colors Each instruction fetched is assigned the “current branch tag” Each predicted branch causes a new branch tag to be allocated (and becomes the current tag) The following slides just discuss possible ways in which one could try to attempt to implement faster recovery mechanisms. I’m not aware of any processors that have actually used these. Tags not being in order could just be due to how the tags are recycled and reassigned. branch ROB Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 (Tags might not necessarily be in any particular order) Lecture 5: Advanced Fetch

Branch Tags (2)  7 5 3 mispred! ROB Tags Tag List 1 2 4 7 5 3 1 1 1 1 You only broadcast any tags after the mispredicted branch. Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 Tag List 1 2 4 7 5 3 Lecture 5: Advanced Fetch

Overkill for ROB / LSQ ROB and LSQ keep instructions in program order (more on this in future lecture) All instruction physically after the mispredicted branch should be squashed … Simple! Some sort of tagging/coloring useful for RS! instructions in RS may be in arbitrary order may be multiple sets of RS’s ROB Even the “simple” squashing isn’t necessarily that trivial because the buffers are organized as circular queues, and so the circuitry has to properly handle wrap-around. Integer RS FP RS Lecture 5: Advanced Fetch

Overall area overhead is Hardware Complexity invalidate tag 0 Height increases with num branch tags invalidate tag 1 invalidate tag 2 my tag Width increases with num branch tags = = = squash Overall area overhead is quadratic in tag count Lecture 5: Advanced Fetch

Simplifications For a ROB with n entries, could potentially have n different branches, each requiring a unique tag In practice, only a fraction of insts are branches, so limit to k < n tags instead If a k+1st branch is fetched, dispatch must be stalled until a tag has been deallocated Worst case: consider if the oldest branch mispredicts and so all subsequent tags have to be broadcasted… now what if all subsequent instructions are branches (one tag each)! That’s a huge amount of broadcasts. Limits on the number of in-flight branches may be needed due to constraints on other resources (e.g., state for recovering speculative branch history register(s) after mispredictions). Lecture 5: Advanced Fetch

Simplifications (2) For k tags, may need to broadcast all if oldest branch mispredicted, resulting in O(k2) overhead Limit to only one (for example) broadcast per cycle In this example, recovery takes three cycles instead of one. Note that it’s not necessary to recover in a single cycle… to avoid any performance penalties, you only need to have recovered by the time the newly fetched correct-path instructions make it to the dispatch point in the processor (i.e., if the processor front-end takes 5 cycles from fetch to dispatch, then recovering in 2 cycles doesn’t really buy you anything more than if you took 4 cycles to recover). 7  5  3  Resume Fetch Lecture 5: Advanced Fetch

Branch Predictor Latency To provide a continuous stream of instructions, the branch predictor must make one prediction every cycle Pipelining? Nope. If current prediction is NT, then next PC is A. If taken, then next PC is B. A dependency exists between successive predictions Limits predictor size/latency Smaller predictor is less accurate Or clock frequency penalty Lecture 5: Advanced Fetch

Ahead Prediction Normally: Instead: PC1  PC2  PC3  PC4  PC5  … Each “” is a prediction that takes a single cycle PCi is predicted from PCi-1 Instead: PC1  PC3  PC5  … and PC2  PC4  … PCi is predicted from PCi-2, and so the prediction can take two cycles instead of one In general, can k-ahead pipeline the predictor Lecture 5: Advanced Fetch

Ahead Prediction Timing 2-cycle ahead-pipelined branch predictor Fetch Address PCi PCi+2 Cycle k PCi-1 PCi PCi+1 PCi+3 Cycle k+1 PCi PCi+1 PCi+2 PCi+4 Cycle k+2 PCi+1 PCi+2 Lecture 5: Advanced Fetch

Ahead Prediction Misprediction The address before NPC is the PC of the mispredicted branch PC  PCwrong New PC sent to front-end PC  NPC mispredict! ??? NPC Cycle k: mispredict Cycle k+1: NPC  I$, PC  predictor - - - PC Cycle k+2: I$ bubble, NPC  predictor PC NPC PC  next-next PC (N2PC) Cycle k+3: N2PCI$, N2PC  predictor NPC N2PC Lecture 5: Advanced Fetch

Overriding Branch Predictors Use two branch predictors 1st one has single-cycle latency (fast, medium accuracy) 2nd one has multi-cycle latency, but more accurate Second predictor can override the 1st prediction if it disagrees Idea: better to pay for a small number of bubbles (difference in 1st and 2nd predictor latencies) than to pay for a full branch misprediction (full pipeline flush, 20+ cycles of delay) Lecture 5: Advanced Fetch

Overriding Predictors (2) Z A B Predict A Predict A’ Predict B A’ B’ Fetch A Predict C B’ A’ C’ Fetch B Fetch A Fast 1st Pred 2-cycle Pipelined I$ Slower 2nd Pred If A != A’, flush A, B andC restart fetch with A’ If A=A’ (both preds agree), done Lecture 5: Advanced Fetch

Benefit of Overriding Predictors Assume 1-cycle predictor 80% accuracy 3-cycle predictor 95% accuracy Misprediction penalty of 20 cycles Fetch bubbles per branch 1-cycle pred only: 0.80 + 0.220 = 4 3-cycle pred only: 0.953 + 0.0520 = 3.85 Overriding config: 0.80.950 + 0.20.953 + 0.20.0520 + 0.80.0523 = 1.69 Fetch bubbles per branch: lower is better Worst case, branch mispred penalty is worse than without overriding predictors! Lecture 5: Advanced Fetch

Speculative Branch Update Ideal branch prediction problem Given PC, predict branch outcome Given actual outcome, update/train predictor Repeat Actual problem Streams of predictions and updates in parallel A Predict: B C D E F G Update: time Lecture 5: Advanced Fetch

Speculative Branch Update (2) BHR update cannot be delayed until branch retirement Predict: A B C D E F G Update: A B C D E F G Branch F would also likely use a stale BHR value unless the update from A happens at the very start of the cycle. Can’t update BHR until commit because outcome not known until then BHR: 011010 011010 011010 011010 011010 110101 Branches B-E all predicted with The same stale BHR value Lecture 5: Advanced Fetch

Speculative Branch Update (3) Update branch history using predictions Speculative update If predictions are correct, then BHR is correct Effectively simulates alternating lookup and update w.r.t. the BHR So what if there’s a misprediction? Checkpoint and recover Lecture 5: Advanced Fetch

Recovery of Speculative BHR BPred Lookup 0110100100100… Recovery during commit/retirement of the mispredicted branch. Speculative BHR BPred Update Retirement Mispredict! Retirement BHR Lecture 5: Advanced Fetch

Execution-Time Recovery Commit-time recovery may substantially delay branch misprediction recovery $-miss to DRAM Have every branch checkpoint the BHR at the time it predicted On mispredict, recover the speculative BHR from this checkpoint Load The figure on the right is just showing an example where waiting until commit to recover can cost you a lot of time. Br Executed, but can’t recover until load retires Lecture 5: Advanced Fetch

Observed paths through the program Traces A “Trace” is a dynamic stream of instructions Example Traces A B C H I J K L G A B C D Done with recovery… moving on to other topics. E F G H I J K L Static Layout Observed paths through the program Lecture 5: Advanced Fetch

Trace Cache Idea is to cache dynamic Traces instead of static instructions A B C D E F G H I J Trace Cache E F G A B C D E F G H I J I$ Fetch (5 cycles) H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$ Lecture 5: Advanced Fetch

Hardware Organization Fetch Address Insts Tag, etc. Hit Logic Trace Cache I$ BPred BTB Line-Fill Buffer Fill Control Merge Logic BTB Logic Mask, exchange, Shift instruction latch to decoder Lecture 5: Advanced Fetch

Tags, etc. Tag Fall-thru Addr 3rd branch # Br. 2nd branch Target Addr 1st branch Branch Mask A 3 11,1 X Y Fetch: A Branches 1&2 both Taken in this trace Trace ends in a branch Lecture 5: Advanced Fetch

Hit Logic, Next Address Selection Fetch: A Tag # Br. Mask Fall-thru Target Multi-BPred T Cond. AND Match Remaining Block(s) Trace hit = A 3 11,1 X Y N 0 1 Next Fetch Address = Making multiple branch predictions discussed on next slide Match 1st Block Lecture 5: Advanced Fetch

Generating Multiple Predictions BHR BPred BPred BPred Serialized access: incredibly slow Three predictions in parallel Predictor must be BHR-based only (no PC bits!) Lecture 5: Advanced Fetch

Associativity Set-Associativity Path/Trace-Associativity ABC ABC XYZ Benefit: reduced miss rate Cost: access time, replacement complexity Path/Trace-Associativity Benefit: possible reduced miss rate, trace-thrashing Cost: access time, replacement complexity, code duplication ABC ABD A B D A B C Lecture 5: Advanced Fetch

Indexing T$ A B C D X Y Works if path after AB consistently correlates with path before AB A A B C A BHR bits B C D Provides similar benefit to path-assoc. Lecture 5: Advanced Fetch

Trace Fill Unit Placement Build Trace at Fetch ROB trace construction buffer Instructions from Retire T$ Store when trace complete Build Trace at Retire I$ To Decode trace construction buffer T$ Store when trace complete Lecture 5: Advanced Fetch

Trace Fill Unit Placement (2) At Fetch Speculative traces (uses branch prediction - not verified) Construction buffer management Building ABC, detect mispredict  should be ABD; need to find C in the buffer, clean it out, and then insert D At Retire Non-speculative, all traces are “correct” No interaction with branch predictor Simpler construction buffer Slower response time Time from fetching ABC  retiring ABC may be long Until retirement, ABC not in T$ and fetch must use I$ Lecture 5: Advanced Fetch

Trace Selection Some traces may have poor temporal locality Storing ACD evicts ABD (assuming no path-assoc), but likely won’t be useful Alternative, use a trace filtering mechanism extra HW required A 97% 3% B C D Lecture 5: Advanced Fetch

Statistical Filtering [PACT 2005] For each trace, insert with probability p < 1.0 Example: p=0.05 (5% chance of insertion per trace) Hot trace: ABC, seen 50 times Cold trace: XYZ, seen twice Probability of ABC getting inserted 1.0 – P(not getting inserted) = 1.0 – (1.0-0.05)50 = 1.0 – 0.9550 = 92.3% (good chance that ABC gets in the T$) Probability of XYZ getting inserted 1.0 – (1.0-0.1)2 = 1.0-0.92 = 9.75% (not so likely) Lecture 5: Advanced Fetch

Partial Matches Fetch: A Trace $ BPred I$ ABC ABD A = Partial Hit? = Benefit: More insts Cost: More complex “hit” logic Squashing logic ABC  AB Targets for intermediate branches AB A Lecture 5: Advanced Fetch

Netburst (P4) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Lecture 5: Advanced Fetch

Trace Prediction Each trace has a unique identifier, analogous to but different from a conventional PC effectively starting PC plus intra-trace branch directions Trace predictor takes a trace-id as input, and outputs a predicted next-trace-id Trace cache is indexed with the trace-id, tag match against trace-id as well Lecture 5: Advanced Fetch

No I$, Decoded Trace Cache No I$ means T$ miss must pay the latency for an L2 access Severe performance penalty for applications with poor trace locality Decoded instructions remove decode logic from branch misprediction penalty Misp Fetch Fetch Dec Dec Ren Disp Exec Mispredict Penalty Misp T$ T$ Ren Disp Exec Lecture 5: Advanced Fetch