Advanced Microarchitecture

Advanced Microarchitecture
Lecture 3: Superscalar Fetch

Fetch Rate is an ILP Upper Bound
To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve! This also suggests that you don’t need to fetch N instructions every cycle, just on average I’m not fat! I just have a lot of calorie buffers… Lecture 3: Superscalar Fetch

Impediments to “Perfect” Fetch
A machine with superscalar degree N will ideally fetch N instructions every cycle This doesn’t happen due to Instruction cache organization Branches And interaction between the two Lecture 3: Superscalar Fetch

Instruction Cache Organization
To fetch N instructions per cycle from I$, we need Physical organization of I$ row must be wide enough to store N instructions Must be able to access entire row at the same time Address Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Alternative: do multiple fetches per cycle Not Good: increases cycle time latency by too much Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Fetch Operation Each cycle, PC of next instruction to fetch is used to access an I$ line The N instructions specified by this PC and the next N-1 sequential addresses form a fetch group The fetch group might not be aligned with the row structure of the I$ Lecture 3: Superscalar Fetch

Fragmentation via Misalignment
If PC = xxx01001, N=4: Ideal fetch group is xxx01001 through xxx01100 (inclusive) xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Row width Can only access one line per cycle, means we fetch only 3 instructions (instead of N=4) Fetch group Lecture 3: Superscalar Fetch

Fetch Rate Computation
Assume N=4 Assume fetch group starts at random location Then fetch rate = ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 instructions per cycle This is just to demonstrate how to analytically estimate fetch rates Lecture 3: Superscalar Fetch

Reduces Fetch Bandwidth
It now takes two cycles to fetch N instructions Halved fetch bandwidth! xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst “Reduction may not be as bad as a full halving”: just because you fetched only K < N instructions during cycle 1 does not limit you to only fetching N-K instructions in cycle 2. 011 Tag Inst Inst Inst Inst Decoder Cycle 2 111 Tag Inst Inst Inst Inst xxx01100 00 01 10 11 000 Cycle 1 Tag Inst Inst Inst Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Inst Reduction may not be as bad as a full halving 111 Tag Inst Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Reducing Fetch Fragmentation
Make |Fetch Group| != |Row Width| Address Tag Cache Line Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst This approach is not terribly practical because you either have to read out twice as many instructions (2x the bitlines), or you need some special logic to enable one wordline for some columns, and another wordline for the others. Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Inst Inst Inst Inst If start of fetch group is N or more from the end of the cache line, then N instructions can be delivered Lecture 3: Superscalar Fetch

May Require Extra Hardware
Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder Arbitrary rotation is not cheap to implement! Remember that each line represents a full instruction which may be 32 bits wide. Tag Inst Inst Inst Inst Inst Inst Inst Inst Rotator Inst Inst Inst Inst Aligned fetch group Lecture 3: Superscalar Fetch

Let N=4, cache line size = 8 Then fetch rate = 5/8 x 4 + 1/8 x 3 + 1/8 x 2 + 1/8 x 1 = 3.25 instructions per cycle Another example simply assuming that the cacheline is twice as wide. Lecture 3: Superscalar Fetch

Fragmentation via Branches
Even if fetch group is aligned, and/or cache line size > than fetch group, taken branches disrupt fetch Tag Inst Inst Inst Inst Tag Inst Branch Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst X X Lecture 3: Superscalar Fetch

Let N=4 Branch every 5 instructions on average Assume branch always taken Assume branch target may start at any offset in a cache row 25% chance of fetch group starting at each location 20% chance for each instruction to be a branch Lecture 3: Superscalar Fetch

Fetch Rate Computation (2)
start of fetch group ¼ x 1 instruction start of fetch group ¼ x ( 0.2 x x 2 ) start of fetch group ¼ x ( 0.2 x x ( 0.2 x x 3 ) ) Easy exercise: make students estimate fetch rate with different taken probabilities and cacheline widths start of fetch group ¼ x ( 0.2 x x ( 0.2 x x ( 0.2 x x 4 ) ) ) = Instructions Fetched per Cycle Simplified analysis: doesn’t account for higher probability of fetch group being aligned due to previous fetch group not containing branches Lecture 3: Superscalar Fetch

Instruction Buffer Network
Ex. IBM RS/6000 PC = B1010 T logic T logic T logic 2 3 B A0 B0 A1 B1 A2 B2 A3 B3 One Cache Line 1 A4 B4 1 A5 B5 1 A6 B6 1 A7 B7 2 A8 B8 2 A9 B9 2 A10 B10 2 A11 B11 3 Address can be broken down like (B)(10)(10) So we want to fetch from addresses B1010, B1011, B1100 and B1101 In the 3rd column (first instruction to fetch), the T-logic compares the offset (the last 10) to its own position (column 2). Since the offset is less than or equal to its own position (3rd column would have an index of 2), the T-logic does not modify the row selection (the first 10). In the 4th column (column 3), the T-logic similarly compares the original offset of 2 to its column index of 3 and leaves the row index alone. In the 1st column (column 0), the original offset of 2 is greater than the column index of 0, and so the T-index increments the row index to select the next row instead (row 3). The 2nd column behaves similarly. This results in half of the instructions coming from row 2 (for columns 2 and 3), and the other half coming from row 3 (for columns 0 and 1). At the very end, a tag check is still performed with the upper bits of the original address. A12 B12 3 A13 B13 3 A14 B14 3 A15 B15 From Tag Check Logic B12 B13 B10 B11 Instruction Buffer Network B11 B12 B13 B10 Lecture 3: Superscalar Fetch

Types of Branches Direction: Target:
Conditional vs. Unconditional Target: PC-encoded PC-relative Absolute offset Computed (target derived from register) Must resolve both direction and target to determine the next fetch group Lecture 3: Superscalar Fetch

Prediction Generally use hardware predictors for both direction and target Direction predictor simply predicts that a branch is taken or not-taken (Exact algorithms covered next lecture) Target prediction needs to predict an actual address This lecture does not discuss how to predict the direction of branches (T vs. NT)… see next lecture for that. Lecture 3: Superscalar Fetch

Where’s the branch in this fetch group?
Where Are the Branches? Before we can predict a branch, we need to know that we have a branch to predict! Where’s the branch in this fetch group? I$ PC Main point being that if all we have is a PC, we don’t know where any branches are (if they even exist) since we haven’t even yet fetched the instructions (let alone decoded them). Lecture 3: Superscalar Fetch

Simplistic Fetch Engine
Fetch PC Huge latency! Clock frequency plummets I$ Target Pred Dir Pred PD = predecoder (only does enough decode work to determine the branches) Mux selects the first branch in the fetch group (there may be multiple branches) PD PD PD PD + sizeof(inst) Branch’s PC Lecture 3: Superscalar Fetch

Branch Identification
Predecode branches on fill from L2 I$ Store 1 bit per inst, set if inst is a branch partial-decode logic removed Target Pred Dir Pred Branch’s PC + Note: sizeof(inst) may not be known before decode (ex. x86) sizeof(inst) … still a long latency (I$ itself sometimes > 1 cycle) Lecture 3: Superscalar Fetch

Line Granularity Predict next fetch group independent of exact location of branches in current fetch group If there’s only one branch in a fetch group, does it really matter where it is? The obvious challenge is it a fetch group contains more than one branch; in such a situation, having only one predictor entry per group (rather than per instruction) will lead to aliasing problems, potentially for both direction and target prediction. This is discussed more next slide. X X T T X N X N X One predictor entry per fetch group X One predictor entry per instruction PC Lecture 3: Superscalar Fetch

Predicting by Line Better! Latency determined by BPred I$ Target Pred
Dir Pred br1 br2 X Y Correct Dir Pred Correct Target Pred Main point being that the critical path does not go through the I$ anymore. The side table just illustrates the point made in the comments of the previous slide: since there are two branches in this one fetch group/cacheline, this may lead to more difficult prediction scenarios. + br1 br2 sizeof($-line) N N N -- This is still challenging: we may need to choose between multiple targets for the same cache line N T T Y Cache Line address T -- T X Lecture 3: Superscalar Fetch

Multiple Branch Prediction
PC I$ no LSBs of PC Target Pred Dir Pred sizeof($-line) LSBs of PC I.e., trying to make predictions for all of the branches within the cacheline at the same time. + addr0 addr1 addr2 addr3 N N N T Scan for 1st “T” 0 1 Lecture 3: Superscalar Fetch

Direction Prediction Details next lecture
Over 90% accurate today for integer applications Higher for FP applications Lecture 3: Superscalar Fetch

Target Prediction PC-relative branches Sizeof(inst) doesn’t change
If not-taken: next address = branch address + sizeof(inst) If taken: next address = branch address + SEXT(offset) Sizeof(inst) doesn’t change Offset doesn’t change (not counting self-modifying code) Indirect branches not discussed here, although they should be mentioned. Lecture 3: Superscalar Fetch

Taken Targets Only Only need to predict taken-branch targets
Taken branch target is the same every time Prediction is really just a “cache” Target Pred Be careful about whether you add “sizeof(inst)” or “sizeof(cacheline)” to the PC (and really it’s the PC of the start of the cacheline if you’re adding “sizeof(cacheline)”). + sizeof(inst) PC Lecture 3: Superscalar Fetch

Branch Target Buffer (BTB)
Branch Instruction Address (Tag) Branch PC V BIA BTA Branch Target Address Valid Bit = Next Fetch PC Hit? Lecture 3: Superscalar Fetch

Set-Associative BTB PC = = = Next PC V tag target V tag target V tag
Lecture 3: Superscalar Fetch

Cutting Corners Branch prediction may be wrong
Processor has ways to detect mispredictions Tweaks that make BTB more or less “wrong” don’t change correctness of processor operation May affect performance Lecture 3: Superscalar Fetch

Partial Tags 00000000cfff9810 00000000cfff9824 00000000cfff984c
v cfff981 cfff9704 cfff9810 v cfff982 cfff9830 cfff9824 v cfff984 cfff9900 cfff984c May lead to false hits, as shown by the red address. beef9810 cfff9810 cfff9824 cfff984c v f981 cfff9704 f982 cfff9830 f984 cfff9900 Lecture 3: Superscalar Fetch

PC-offset Encoding 00000000cfff984c 00000000cfff984c 00000000cf ff9900
v f981 cfff9704 v f982 cfff9830 cfff984c v f984 cfff9900 v f981 ff9704 Branch targets are usually close by, which results in the upper bits of the target’s address usually being identical to those in the original PC. v f982 ff9830 cfff984c v f984 ff9900 If target is too far away, or original PC is close to “roll-over” point, then target will be mispredicted cf ff9900 Lecture 3: Superscalar Fetch

BTB Miss? Dir-Pred says “taken” Target-Pred (BTB) misses
Could default to fall-through PC (as if Dir-Pred said NT) But we know that’s likely to be wrong! Stall fetch until target known … when’s that? PC-relative: after decode, we can compute target Indirect: must wait until register read/exec Lecture 3: Superscalar Fetch

Stall on BTB Miss PC I$ Decode + displacement BTB ??? Dir T Pred
Basically just points out that fetch stalls until some point after decode. Dir Pred T Next PC (unstall fetch) Lecture 3: Superscalar Fetch

BTB Miss Timing Cycle i i+1 i+2 i+3 + stall Current PC Next PC
Start I$ Access Start I$ Access BTB Lookup (Miss) Decode + Same thing in a little more detail Cycle i i+1 i+2 i+3 Stage 1 Stage 2 Stage 3 Stage 4 BTB miss I$ access decode stall rename Inject nops i+4 Lecture 3: Superscalar Fetch

Decode-time Correction
PC Similar penalty as a BTB miss I$ Decode + Fetch continues down path of “foo” BTB displacement foo BTB may provide incorrect target (for example, if partial tags result in a false hit). bar Dir Pred Later, we discover predicted target was wrong; flush insts and resteer (3 cycles of bubbles better than 20+) T Lecture 3: Superscalar Fetch

What about Indirect Jumps?
PC I$ Decode BTB ??? Get target from R5 The point here is that it may take a very long time to compute the target since the target is the result of another instruction. Stalling guarantees that you get nowhere. Fetching down the NT path gets to ahead in the cases where the direction prediction was wrong (this may not be very frequent, but if it’s non-zero, then you’re ahead of the game). From a power perspective, speculating down the NT path may not be as good since the common case is that the direction prediction is in fact correct. Stall until R5 is ready and branch executes may be a while if Load R5 = 0[R3] misses to main memory Fetch down NT-path why? Dir Pred T Lecture 3: Superscalar Fetch

Subroutine Calls No Problem! P: 0x1000: (start of printf) 1 FFB 0x1000
A: 0xFC34: CALL printf 1 FC3 0x1000 Just showing that a regular BTB handles subroutine calls without a problem. Each call site just ends up allocating a separate BTB entry, even though each entry contains the same target. B: 0xFD08: CALL printf 1 FD0 0x1000 C: 0xFFB0: CALL printf Lecture 3: Superscalar Fetch

Subroutine Returns X P: 0x1000: ST $RA  [$sp] 0x1B98: LD $tmp  [$sp]
0x1B9C: RETN $tmp 1 1B9 0xFC38 A: 0xFC34: CALL printf Demonstrating the a regular BTB cannot handle accurate return address prediction when the function is called from multiple sites. A’:0xFC38: CMP $ret, 0 X B: 0xFD08: CALL printf B’:0xFD0C: CMP $ret, 0 Lecture 3: Superscalar Fetch

Return Address Stack (RAS)
Keep track of call stack A: 0xFC34: CALL printf FC38 FC38 BTB P: 0x1000: ST $RA  [$sp] … D004 0x1B9C: RETN $tmp A’:0xFC38: CMP $ret, 0 FC38 Lecture 3: Superscalar Fetch

Overflow Wrap-around and overwrite Do not modify RAS
64AC: CALL printf FC90 top of stack 64B0 421C ??? 48C8 7300 Just discussing what to do when the function call depth is greater than the size of your RAS. Option 1 is probably the best. Another interesting example to consider is a simple recursive function: f() calls f(), but with a very deep level of nesting; if the recursion always happens from the same spot in the function f(), then the return addresses will always be the same, and so even though the RAS may overflow, it keeps getting overwritten with the same return address. Wrap-around and overwrite Will lead to eventual misprediction after four pops Do not modify RAS Will lead to misprediction on next pop Lecture 3: Superscalar Fetch

How Can You Tell It’s a Return?
Pre-decode bit in BTB (return=1, else=0) Wait until after decode Initially use BTB’s target prediction After decode when you know it’s a return, treat like it’s a BTB miss or BTB misprediction Costs a few bubbles, but simpler and still better than a full pipeline flush Lecture 3: Superscalar Fetch

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback