Download presentation
Presentation is loading. Please wait.
Published byJustin O’Brien’ Modified over 6 years ago
1
Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1
2
MIPS R10000 MAINAK CS422 2 CS422 MAINAK CS422 MAINAK 2
3
Overview Mid 90s: One of the first dynamic out-of-order issue superscalar RISC microprocessors 6.8 M transistors on 298 mm2 die (0.35 μm CMOS) Out of 6.8 M transistors 4.4 M are devoted to L1 instruction and data caches Fetches, decodes, renames 4 instructions every cycle 64-bit registers: the data path width is 64 bits On-chip 32 KB L1 instruction and data caches, 2-way set associative Off-chip L2 cache of variable size (512 KB to MB), 2-way pseudo set associative, line size 64 or 128 bytes MAINAK CS422 3 CS422 MAINAK
4
Stage 1: Fetch The instructions are slightly pre-decoded when the cache line is brought into Icache (4 extra FU bits) Simplifies the decode stage Processor fetches four sequential instructions every cycle from the Icache The iTLB has eight entries, fully associative, backed by a larger unified TLB No BTB So the fetcher really cannot do anything about branches other than fetching sequentially Fetched instructions are put in an eight-entry instruction buffer for the decoder to consume MAINAK CS422 4 CS422 MAINAK
5
Stage 2: Decode/Rename Decodes and renames four instructions every cycle The targets of branches, unconditional jumps, and subroutine calls (jal and jalr) are computed in this stage Unconditional jumps are not fed into the pipeline and the fetcher PC is modified directly by the decoder Conditional branches look up a bimodal predictor to predict the branch direction (taken or not taken) and accordingly modify the fetch PC; returns look up a RAS MAINAK CS422 5 CS422 MAINAK
6
Branch prediction Branches are predicted and unconditional jumps are computed in stage 2 There is always a one-cycle bubble (four instructions) In case of branch misprediction (which will be detected later) the processor may need to roll back and restart fetching from the correct target Need to checkpoint (i.e. save) the register map right after the branch is renamed (will be needed to restore in case of misprediction) The processor supports at most four register map checkpoints; this is stored in a structure called branch stack (really, it is a FIFO queue, not a stack) Can support up to four in-flight branches MAINAK CS422 6 CS422 MAINAK
7
Branch predictor The predictor is an array of 512 two-bit saturating counters Can count up to 3; if already 3, an increment does not have any effect (remains at 3) Similarly, if the count is 0, a decrement does not have any effect (remains at 0) The array is indexed by PC[11:3] Ignore lower 3 bits, take the next 9 bits The outcome is the count at that index of the predictor If count >= 2 then predict taken; else not taken Very simple algorithm; prediction accuracy of 85+% on most benchmarks; works fine for short pipes Commonly known as bimodal branch predictor MAINAK CS422 7 CS422 MAINAK
8
Branch predictor The branch predictor is updated when a conditional branch retires (in-order update because retirement is in-order) At retirement we know the correct outcome of the branch At this point the branch stack entry is also freed One branch stack entry contains the entire register map (not the register values), the target of the branch, and few control bits Decoder assigns a 4-bit branch mask to every instruction What is the use of it? MAINAK CS422 8 CS422 MAINAK
9
Conditional instructions
Want to get rid of branches completely if (!A) { a = b; } cmovz r2, r3, r1 The move executes only if r1 is zero Converts control dependences to data dependences Essentially we have more time to get r1 ready: instead of in the fetcher we now need it in EX stage Also known as if-conversion Useful in compiling y=abs(x): if (x<0) {y=-x;} else {y=x;} Very useful in getting rid of hard-to-predict branches However, eliminating branches that guard a large piece of code may require too many conditional moves cmovz is supported by all processors today How does it interact with register renaming? 9 CS422 MAINAK
10
Register renaming Takes place in the second pipeline stage
As we have discussed, every destination is assigned a new physical register from the free list The sources are assigned the existing map Map table is updated with the newly renamed dest. For every destination physical register, a busy bit is set high to signify that the value in this register is not yet ready; this bit is cleared after the instruction completes execution (but before retirement) The integer and floating-point instructions are assigned registers from two separate free lists The integer and fp register files are separate (each has 64 registers) Complications with mult/div MAINAK CS422 10 CS422 MAINAK
11
Register renaming Structure of the map table
A multiported RAM 16 read ports, 4 write ports Third operand is a condition bit Renamer uses 24 5-bit comparators to resolve dependencies What are these dependencies? Organization of the free list Four-way banked, each bank is 8-entry FIFO; each bank’s read pointer gives you a free register id MAINAK CS422 11 CS422 MAINAK
12
Preparing to issue Finally, during the second stage every instruction is assigned an active list entry The active list is a 32-entry FIFO queue which keeps track of all in-flight instructions (at most 32) in-order Each entry contains various info about the allocated instruction such as physical dest reg number etc. Organization? Also, each instruction is assigned to one of the three issue queues depending on its type Integer queue: holds integer ALU instructions Floating-point queue: holds FPU instructions Address queue: holds the memory operations Therefore, stage 2 may stall if the processor runs out of: active list entries, physical regs, issue queue entries MAINAK CS422 12 CS422 MAINAK
13
Stage 3: Issue Three issue queue selection logics work in parallel
Integer and fp queue issue logics are similar Integer issue logic Integer queue contains 16 entries (can hold at most 16 instructions); collapsible CAM Search for ready-to-issue instructions among these 16 Issue at most two instructions to two ALUs Back-to-back integer instruction issue Address queue Slightly more complicated, but a FIFO CAM When a load or a store is issued the address is still not known To simplify matters, R10000 issues load/stores in-order (we have seen problems associated with out-of-order load/store issue) MAINAK CS422 13 CS422 MAINAK
14
Address queue Load retry unit Two situations require retrying a load
A data cache miss A memory address conflict (what is this?) Two 16x16 matrices track address dependence information Rows and columns are AQ entries The first matrix avoids unnecessary thrashing by allocating one way in a set to the oldest conflicting AQ entry The second matrix records load/store dependency at 64-bit granularity and carries out load forwarding A returning refill snoops the AQ and wakes up all matching instructions; matching load entries retry through a dedicated cache port MAINAK CS422 14 CS422 MAINAK
15
Load-dependents The loads take two cycles to execute
During the first cycle the address is computed During the second cycle the dTLB and data cache are accessed Ideally I want to issue an instruction dependent on the load so that the instruction can pick up the load value from the bypass just in time Assume that a load issues in cycle 0, computes address in cycle 1, and looks up cache in cycle 2 I want to issue the dependent in cycle 2 so that it can pick up the load value just before executing in cycle 3 Thus the load looks up cache in parallel with the issuing of the dependent; the dependent is issued even before it is known whether the load will hit in the cache; this is called load hit speculation (re-execute later if the load misses) MAINAK CS422 15 CS422 MAINAK
16
Functional units Right after an instruction is issued it reads the source operands (dictated by physical reg numbers) from the register file (integer or fp depending on instruction type) From stage 4 onwards the instructions execute Two ALUs: branch and shift can execute on ALU1, multiply/divide can execute on ALU2, all other instructions can execute on any of the two ALUs; ALU1 is responsible for triggering rollback in case of branch misprediction (marks all instructions after the branch as squashed, restores the register map from correct branch stack entry, sets fetch PC to the correct target) Four FPUs: one dedicated for fp multiply, one for fp divide, one for fp square root, most of the other instructions execute on the remaining FPU LSU (Load/store unit): Two address calc. ALUs (result of one is selected), dTLB is fully assoc. with 64 entries and translates 44-bit VA to 40-bit PA, PA is used to match dcache tags (virtually indexed physically tagged) MAINAK CS422 16 CS422 MAINAK
17
Register file Seven read ports in integer file
Two read ports for each ALU, two read ports for AGU One read port shared between store, jr/jalr, move to floating- point file Three write ports in integer file One for each ALU One shared by load, jal/jalr, move from floating-point file 64-bit predicate vector attached to integer file Needed for CMOVZ Five read ports in floating-point file Two each for adder and multiplier, one shared between store and move Three write ports in floating-point file One each for adder and multiplier, one shared between load and move CS422 MAINAK
18
Result writeback As soon as an instruction completes execution the result is written back to the destination physical register No need to wait till retirement since the renamer has guaranteed that this physical destination is associated with a unique instruction in the pipeline Also the results are launched on the bypass network (from register file write ports to inputs of ALU/FPU/AGU) This guarantees that dependents can be issued back-to- back and still they can receive the correct value add r3, r4, r5; add r6, r4, r3; (can be issued in consecutive cycles, although the second add will read a wrong value of r3 from the register file) MAINAK CS422 18 CS422 MAINAK
19
Retirement or commit Immediately after the instructions finish execution they may not be able to leave the pipe In-order retirement is necessary for precise exception When an instruction comes to the head of the active list it can retire R10k retires 4 instructions every cycle Retirement involves Updating the branch predictor and freeing its branch stack entry if it is a branch instruction Moving the store value from the address queue entry to the L1 data cache if it is a store instruction Freeing old destination physical register and updating the register free list Freeing the address queue entry if it is a load/store And, finally, freeing the active list entry itself MAINAK CS422 19 CS422 MAINAK
20
Memory hierarchy On-chip L1 instruction and data caches
Both 2-way set associative, 32 KB Data cache has 32 bytes line size while instruction cache has 64 bytes line size Both the caches are virtually indexed and physically tagged With a 4 KB page size, data cache runs into a synonym problem (upper 2 bits in index) Uses complete PPN as tag Keeps upper two bits of index in L2 tag (how does it help?) Data cache has four ports: refill, issued address, load retry, store graduation Reads data RAM from both ways speculatively, selects one or zero based on tag RAM outcome MAINAK CS422 20 CS422 MAINAK
21
Memory hierarchy Off-chip L2 cache
2-way pseudo set associative, 512 KB to 16 MB Why pseudo associativity? An MRU way selection RAM is maintained on-chip In the first cycle the 16 data bytes of selected way is read in parallel with the tag In the next cycle next 16 data bytes of selected way is read in parallel with the tag of the alternate way (achieved by toggling one extra address bit) As the tags arrive on-chip they are compared Hit on first tag returns critical data in the same cycle so that the processor can continue while the remaining bytes fill Hit on second tag must initiate new L2 data read cycles and wait until that is completed MAINAK CS422 21 CS422 MAINAK
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.