Advanced Microarchitecture

Slides:



Advertisements
Similar presentations
CSE502: Computer Architecture Superscalar Decode.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Lecture 12 Reduce Miss Penalty and Hit Time
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Instruction-Level Parallelism Dynamic Branch Prediction
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Computer Architecture Chapter (14): Processor Structure and Function
Prof. Hsien-Hsin Sean Lee
Stalling delays the entire pipeline
Computer Structure Advanced Branch Prediction
Dynamic Branch Prediction
COMP 740: Computer Architecture and Implementation
Computer Architecture Advanced Branch Prediction
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Samira Khan University of Virginia Nov 13, 2017
Appendix C Pipeline implementation
Chapter 4 The Processor Part 4
ECS 154B Computer Architecture II Spring 2009
Flow Path Model of Superscalars
CMSC 611: Advanced Computer Architecture
Lecture 6: Advanced Pipelines
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Ka-Ming Keung Swamy D Ponpandi
Lecture: Branch Prediction
Lecture: Out-of-order Processors
The Processor Lecture 3.6: Control Hazards
Advanced Computer Architecture
Branch Prediction: Direction Predictors
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Instruction Execution Cycle
Chapter 8. Pipelining.
CS-447– Computer Architecture Lecture 20 Cache Memories
Branch Prediction: Direction Predictors
ECE 352 Digital System Fundamentals
Control unit extension for data hazards
Dynamic Hardware Prediction
Wackiness Algorithm A: Algorithm B:
Control unit extension for data hazards
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Advanced Microarchitecture Lecture 3: Superscalar Fetch

Fetch Rate is an ILP Upper Bound To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve! This also suggests that you don’t need to fetch N instructions every cycle, just on average I’m not fat! I just have a lot of calorie buffers… Lecture 3: Superscalar Fetch

Impediments to “Perfect” Fetch A machine with superscalar degree N will ideally fetch N instructions every cycle This doesn’t happen due to Instruction cache organization Branches And interaction between the two Lecture 3: Superscalar Fetch

Instruction Cache Organization To fetch N instructions per cycle from I$, we need Physical organization of I$ row must be wide enough to store N instructions Must be able to access entire row at the same time Address Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Alternative: do multiple fetches per cycle Not Good: increases cycle time latency by too much Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Fetch Operation Each cycle, PC of next instruction to fetch is used to access an I$ line The N instructions specified by this PC and the next N-1 sequential addresses form a fetch group The fetch group might not be aligned with the row structure of the I$ Lecture 3: Superscalar Fetch

Fragmentation via Misalignment If PC = xxx01001, N=4: Ideal fetch group is xxx01001 through xxx01100 (inclusive) xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Row width Can only access one line per cycle, means we fetch only 3 instructions (instead of N=4) Fetch group Lecture 3: Superscalar Fetch

Fetch Rate Computation Assume N=4 Assume fetch group starts at random location Then fetch rate = ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 instructions per cycle This is just to demonstrate how to analytically estimate fetch rates Lecture 3: Superscalar Fetch

Reduces Fetch Bandwidth It now takes two cycles to fetch N instructions Halved fetch bandwidth! xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst “Reduction may not be as bad as a full halving”: just because you fetched only K < N instructions during cycle 1 does not limit you to only fetching N-K instructions in cycle 2. 011 Tag Inst Inst Inst Inst Decoder Cycle 2 111 Tag Inst Inst Inst Inst xxx01100 00 01 10 11 000 Cycle 1 Tag Inst Inst Inst Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Inst Reduction may not be as bad as a full halving 111 Tag Inst Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Reducing Fetch Fragmentation Make |Fetch Group| != |Row Width| Address Tag Cache Line Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst This approach is not terribly practical because you either have to read out twice as many instructions (2x the bitlines), or you need some special logic to enable one wordline for some columns, and another wordline for the others. Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Inst Inst Inst Inst If start of fetch group is N or more from the end of the cache line, then N instructions can be delivered Lecture 3: Superscalar Fetch

May Require Extra Hardware Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder Arbitrary rotation is not cheap to implement! Remember that each line represents a full instruction which may be 32 bits wide. Tag Inst Inst Inst Inst Inst Inst Inst Inst Rotator Inst Inst Inst Inst Aligned fetch group Lecture 3: Superscalar Fetch

Fetch Rate Computation Let N=4, cache line size = 8 Then fetch rate = 5/8 x 4 + 1/8 x 3 + 1/8 x 2 + 1/8 x 1 = 3.25 instructions per cycle Another example simply assuming that the cacheline is twice as wide. Lecture 3: Superscalar Fetch

Fragmentation via Branches Even if fetch group is aligned, and/or cache line size > than fetch group, taken branches disrupt fetch Tag Inst Inst Inst Inst Tag Inst Branch Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst X X Lecture 3: Superscalar Fetch

Fetch Rate Computation Let N=4 Branch every 5 instructions on average Assume branch always taken Assume branch target may start at any offset in a cache row 25% chance of fetch group starting at each location 20% chance for each instruction to be a branch Lecture 3: Superscalar Fetch

Fetch Rate Computation (2) start of fetch group ¼ x 1 instruction start of fetch group ¼ x ( 0.2 x 1 + 0.8 x 2 ) start of fetch group ¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) ) Easy exercise: make students estimate fetch rate with different taken probabilities and cacheline widths start of fetch group ¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) ) = 2.048 Instructions Fetched per Cycle Simplified analysis: doesn’t account for higher probability of fetch group being aligned due to previous fetch group not containing branches Lecture 3: Superscalar Fetch

Instruction Buffer Network Ex. IBM RS/6000 PC = B1010 T logic T logic T logic 2 3 B A0 B0 A1 B1 A2 B2 A3 B3 One Cache Line 1 A4 B4 1 A5 B5 1 A6 B6 1 A7 B7 2 A8 B8 2 A9 B9 2 A10 B10 2 A11 B11 3 Address can be broken down like (B)(10)(10) So we want to fetch from addresses B1010, B1011, B1100 and B1101 In the 3rd column (first instruction to fetch), the T-logic compares the offset (the last 10) to its own position (column 2). Since the offset is less than or equal to its own position (3rd column would have an index of 2), the T-logic does not modify the row selection (the first 10). In the 4th column (column 3), the T-logic similarly compares the original offset of 2 to its column index of 3 and leaves the row index alone. In the 1st column (column 0), the original offset of 2 is greater than the column index of 0, and so the T-index increments the row index to select the next row instead (row 3). The 2nd column behaves similarly. This results in half of the instructions coming from row 2 (for columns 2 and 3), and the other half coming from row 3 (for columns 0 and 1). At the very end, a tag check is still performed with the upper bits of the original address. A12 B12 3 A13 B13 3 A14 B14 3 A15 B15 From Tag Check Logic B12 B13 B10 B11 Instruction Buffer Network B11 B12 B13 B10 Lecture 3: Superscalar Fetch

Types of Branches Direction: Target: Conditional vs. Unconditional Target: PC-encoded PC-relative Absolute offset Computed (target derived from register) Must resolve both direction and target to determine the next fetch group Lecture 3: Superscalar Fetch

Prediction Generally use hardware predictors for both direction and target Direction predictor simply predicts that a branch is taken or not-taken (Exact algorithms covered next lecture) Target prediction needs to predict an actual address This lecture does not discuss how to predict the direction of branches (T vs. NT)… see next lecture for that. Lecture 3: Superscalar Fetch

Where’s the branch in this fetch group? Where Are the Branches? Before we can predict a branch, we need to know that we have a branch to predict! Where’s the branch in this fetch group? I$ PC Main point being that if all we have is a PC, we don’t know where any branches are (if they even exist) since we haven’t even yet fetched the instructions (let alone decoded them). 1001010101011010101001 0101001010110101001010 0101010101101010010010 0000100100111001001010 Lecture 3: Superscalar Fetch

Simplistic Fetch Engine Fetch PC Huge latency! Clock frequency plummets I$ Target Pred Dir Pred PD = predecoder (only does enough decode work to determine the branches) Mux selects the first branch in the fetch group (there may be multiple branches) PD PD PD PD + sizeof(inst) Branch’s PC Lecture 3: Superscalar Fetch

Branch Identification Predecode branches on fill from L2 I$ Store 1 bit per inst, set if inst is a branch partial-decode logic removed Target Pred Dir Pred Branch’s PC + Note: sizeof(inst) may not be known before decode (ex. x86) sizeof(inst) … still a long latency (I$ itself sometimes > 1 cycle) Lecture 3: Superscalar Fetch

Line Granularity Predict next fetch group independent of exact location of branches in current fetch group If there’s only one branch in a fetch group, does it really matter where it is? The obvious challenge is it a fetch group contains more than one branch; in such a situation, having only one predictor entry per group (rather than per instruction) will lead to aliasing problems, potentially for both direction and target prediction. This is discussed more next slide. X X T T X N X N X One predictor entry per fetch group X One predictor entry per instruction PC Lecture 3: Superscalar Fetch

Predicting by Line Better! Latency determined by BPred I$ Target Pred Dir Pred br1 br2 X Y Correct Dir Pred Correct Target Pred Main point being that the critical path does not go through the I$ anymore. The side table just illustrates the point made in the comments of the previous slide: since there are two branches in this one fetch group/cacheline, this may lead to more difficult prediction scenarios. + br1 br2 sizeof($-line) N N N -- This is still challenging: we may need to choose between multiple targets for the same cache line N T T Y Cache Line address T -- T X Lecture 3: Superscalar Fetch

Multiple Branch Prediction PC I$ no LSBs of PC Target Pred Dir Pred sizeof($-line) LSBs of PC I.e., trying to make predictions for all of the branches within the cacheline at the same time. + addr0 addr1 addr2 addr3 N N N T Scan for 1st “T” 0 1 Lecture 3: Superscalar Fetch

Direction Prediction Details next lecture Over 90% accurate today for integer applications Higher for FP applications Lecture 3: Superscalar Fetch

Target Prediction PC-relative branches Sizeof(inst) doesn’t change If not-taken: next address = branch address + sizeof(inst) If taken: next address = branch address + SEXT(offset) Sizeof(inst) doesn’t change Offset doesn’t change (not counting self-modifying code) Indirect branches not discussed here, although they should be mentioned. Lecture 3: Superscalar Fetch

Taken Targets Only Only need to predict taken-branch targets Taken branch target is the same every time Prediction is really just a “cache” Target Pred Be careful about whether you add “sizeof(inst)” or “sizeof(cacheline)” to the PC (and really it’s the PC of the start of the cacheline if you’re adding “sizeof(cacheline)”). + sizeof(inst) PC Lecture 3: Superscalar Fetch

Branch Target Buffer (BTB) Branch Instruction Address (Tag) Branch PC V BIA BTA Branch Target Address Valid Bit = Next Fetch PC Hit? Lecture 3: Superscalar Fetch

Set-Associative BTB PC = = = Next PC V tag target V tag target V tag Lecture 3: Superscalar Fetch

Cutting Corners Branch prediction may be wrong Processor has ways to detect mispredictions Tweaks that make BTB more or less “wrong” don’t change correctness of processor operation May affect performance Lecture 3: Superscalar Fetch

Partial Tags 00000000cfff9810 00000000cfff9824 00000000cfff984c v 00000000cfff981 00000000cfff9704 00000000cfff9810 v 00000000cfff982 00000000cfff9830 00000000cfff9824 v 00000000cfff984 00000000cfff9900 00000000cfff984c May lead to false hits, as shown by the red address. 000001111beef9810 00000000cfff9810 00000000cfff9824 00000000cfff984c v f981 00000000cfff9704 f982 00000000cfff9830 f984 00000000cfff9900 Lecture 3: Superscalar Fetch

PC-offset Encoding 00000000cfff984c 00000000cfff984c 00000000cf ff9900 v f981 00000000cfff9704 v f982 00000000cfff9830 00000000cfff984c v f984 00000000cfff9900 v f981 ff9704 Branch targets are usually close by, which results in the upper bits of the target’s address usually being identical to those in the original PC. v f982 ff9830 00000000cfff984c v f984 ff9900 If target is too far away, or original PC is close to “roll-over” point, then target will be mispredicted 00000000cf ff9900 Lecture 3: Superscalar Fetch

BTB Miss? Dir-Pred says “taken” Target-Pred (BTB) misses Could default to fall-through PC (as if Dir-Pred said NT) But we know that’s likely to be wrong! Stall fetch until target known … when’s that? PC-relative: after decode, we can compute target Indirect: must wait until register read/exec Lecture 3: Superscalar Fetch

Stall on BTB Miss PC I$ Decode + displacement BTB ??? Dir T Pred Basically just points out that fetch stalls until some point after decode. Dir Pred T Next PC (unstall fetch) Lecture 3: Superscalar Fetch

BTB Miss Timing Cycle i i+1 i+2 i+3 + stall Current PC Next PC Start I$ Access Start I$ Access BTB Lookup (Miss) Decode + Same thing in a little more detail Cycle i i+1 i+2 i+3 Stage 1 Stage 2 Stage 3 Stage 4 BTB miss I$ access decode stall rename Inject nops i+4 Lecture 3: Superscalar Fetch

Decode-time Correction PC Similar penalty as a BTB miss I$ Decode + Fetch continues down path of “foo” BTB displacement foo BTB may provide incorrect target (for example, if partial tags result in a false hit). bar Dir Pred Later, we discover predicted target was wrong; flush insts and resteer (3 cycles of bubbles better than 20+) T Lecture 3: Superscalar Fetch

What about Indirect Jumps? PC I$ Decode BTB ??? Get target from R5 The point here is that it may take a very long time to compute the target since the target is the result of another instruction. Stalling guarantees that you get nowhere. Fetching down the NT path gets to ahead in the cases where the direction prediction was wrong (this may not be very frequent, but if it’s non-zero, then you’re ahead of the game). From a power perspective, speculating down the NT path may not be as good since the common case is that the direction prediction is in fact correct. Stall until R5 is ready and branch executes may be a while if Load R5 = 0[R3] misses to main memory Fetch down NT-path why? Dir Pred T Lecture 3: Superscalar Fetch

Subroutine Calls No Problem! P: 0x1000: (start of printf) 1 FFB 0x1000 A: 0xFC34: CALL printf 1 FC3 0x1000 Just showing that a regular BTB handles subroutine calls without a problem. Each call site just ends up allocating a separate BTB entry, even though each entry contains the same target. B: 0xFD08: CALL printf 1 FD0 0x1000 C: 0xFFB0: CALL printf Lecture 3: Superscalar Fetch

Subroutine Returns X P: 0x1000: ST $RA  [$sp] 0x1B98: LD $tmp  [$sp] 0x1B9C: RETN $tmp 1 1B9 0xFC38 A: 0xFC34: CALL printf Demonstrating the a regular BTB cannot handle accurate return address prediction when the function is called from multiple sites. A’:0xFC38: CMP $ret, 0 X B: 0xFD08: CALL printf B’:0xFD0C: CMP $ret, 0 Lecture 3: Superscalar Fetch

Return Address Stack (RAS) Keep track of call stack A: 0xFC34: CALL printf FC38 FC38 BTB P: 0x1000: ST $RA  [$sp] … D004 0x1B9C: RETN $tmp A’:0xFC38: CMP $ret, 0 FC38 Lecture 3: Superscalar Fetch

Overflow Wrap-around and overwrite Do not modify RAS 64AC: CALL printf FC90 top of stack 64B0 421C ??? 48C8 7300 Just discussing what to do when the function call depth is greater than the size of your RAS. Option 1 is probably the best. Another interesting example to consider is a simple recursive function: f() calls f(), but with a very deep level of nesting; if the recursion always happens from the same spot in the function f(), then the return addresses will always be the same, and so even though the RAS may overflow, it keeps getting overwritten with the same return address. Wrap-around and overwrite Will lead to eventual misprediction after four pops Do not modify RAS Will lead to misprediction on next pop Lecture 3: Superscalar Fetch

How Can You Tell It’s a Return? Pre-decode bit in BTB (return=1, else=0) Wait until after decode Initially use BTB’s target prediction After decode when you know it’s a return, treat like it’s a BTB miss or BTB misprediction Costs a few bubbles, but simpler and still better than a full pipeline flush Lecture 3: Superscalar Fetch