Lecture 7: Register Renaming
2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B R1 R2 R3 R B B A A A: R1 = R3 / R4 B: R3 = R2 * R4 Write-After-Read R1 R2 R3 R A A B B R1 R2 R3 R A A B B Write-After-Write A: R1 = R2 + R3 B: R1 = R3 * R R1 R2 R3 R A A B B R1 R2 R3 R A A B B
Register Data Dependencies (this lecture) –Output dependence (WAW), also o –Anti-dependence (WAR), a –True dependence (RAW), t –Why is RAR not a dependency? Memory Data Dependencies (later lecture) Control Dependencies (earlier lectures) Structural Dependencies –Instruction must wait until some “structure” is available Ex: Divider, ROB entry, Branch color/tag, etc. Lecture 7: Register Renaming 3
WAR dependencies are from reusing registers Lecture 7: Register Renaming 4 A: R1 = R3 / R4 B: R3 = R2 * R R1 R2 R3 R A A B B R1 R2 R3 R B B A A R1 R2 R3 R B B A A 4 4 R5 -6 A: R1 = R3 / R4 B: R5 = R2 * R4 X With no dependencies, reordering still produces the correct results With no dependencies, reordering still produces the correct results
WAW dependencies are also from reusing registers Lecture 7: Register Renaming R1 R2 R3 R B B A A 4 4 R A: R1 = R2 + R3 B: R1 = R3 * R R1 R2 R3 R A A B B R1 R2 R3 R A A B B A: R5 = R2 + R3 B: R1 = R3 * R4 X Same solution works
Finite number of registers –At some point, you’re forced to overwrite somewhere –Most RISC: 32 registers, x86: only 8, x86-64: 16 Loops, Code Reuse –If you write a value to R1 in a loop body, then R1 will be reused every iteration induces many false dep’s –Loop unrolling can help a little Will run out of registers at some point anyway Trade off with code bloat –Short function calls can result in similar register reuse Inlining can help a little Lecture 7: Register Renaming 6
Add more registers to the ISA? –Changing the ISA can break binary compatibility x86-64 mostly doesn’t break compatibility, but it’s a hack –All code must be recompiled –Does not address register overwriting due to code reuse from loops and function calls –Not a scalable solution Lecture 7: Register Renaming 7 BAD!!!
Processor has more registers than specified by the ISA temporarily map ISA registers (“logical” or “architected” registers) to the physical registers to avoid overwrites Components: –mapping mechanism –physical registers allocated vs. free registers allocation/deallocation mechanism –state maintenance (commit, mispredictions, etc.) Lecture 7: Register Renaming 8
9 R0 Architected Registers R1 R2 R3 R4 R5 R6 R7 T0 T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 T22 Tn-2 T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 T23 Tn-1 Physical Registers R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] Original Code Renamed Code T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] WAW WAR No False Dependencies!
Lecture 7: Register Renaming 10 Dest = Src1 op Src2 MappingMechanismMappingMechanism Tag S1 op Tag S2 Src1 Tag S1 Src2 Tag S2 Unmapped Physical Registers Unmapped Physical Registers Tag D Tag D = Dest Tag D Repeat for each instruction
Lookup Table –One entry per architected register –Entry stores physical location of most recent version of the logical register –Most recent version may be in the physical register file or in the architected register file Lecture 7: Register Renaming 11 ARF PRF RAT
Lecture 7: Register Renaming 12 R1 = R2 + R3 R0 - - R1 - - R2 - - R3 - - R4 - - R5 - - R6 - - R7 - - T13, T14, T9, T7 Free PRegs T13 = R2 + R T14, T9, T7 R5 = R4 – R1 T14 = R4 + T R1 = R1 * R5 T9, T7 T9 = T13 * T R2 = R5 / R1 T7 T7 = T14 / T
Lecture 7: Register Renaming 13 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T16T23 T39T7 T14T16 T5 X Don’t rename immediates T10 T31 T19 T6 From free register pool For N-wide superscalar: 2N RAT read-ports N RAT write-ports For N-wide superscalar: 2N RAT read-ports N RAT write-ports
Lecture 7: Register Renaming 14 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R1 R5 = Ld 12[R6] RAT T16T23 T39T7 T14T16 T5 X T10 T31 T19 T6 From free register pool This is the wrong version of R1 Should be using this version of R1
Lecture 7: Register Renaming 15 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 RAT T16T34 T34T16 T16T34 T10T16 T31T10 T31T19 Result of sequential renaming T10 T31 T19 T6 From free register pool
Lecture 7: Register Renaming 16 From free register pool Intra-Group Dependency Checker Intra-Group Dependency Checker Inst 0 Inst 1 Inst 2 Inst 3 Src L Src R Dest T 0L T 1L T 2L T 3L T 0R T 1R T 2R T 3R Not needed since 1 st inst in a group has no earlier insts to be dependent on Not needed since 1 st inst in a group has no earlier insts to be dependent on Similarly, src 1L and src 1R cannot be dependent on dst 1, dst 2 or dst 3 Similarly, src 1L and src 1R cannot be dependent on dst 1, dst 2 or dst 3 RAT
Lecture 7: Register Renaming 17 dst 0 dst 1 dst 2 dst 3 src 0L src 0R src 1L = = R 1L T 1L src 1R = = T 1R R 1R src 2L = = T 2L R 2L = = src 2R = = T 2R R 2R = = src 3L = = T 3L = = R 3L = = = = T 3R = = = = R 3R src 3R N-wide rename has O(N) gate delay? N-wide rename has O(N) gate delay? 0 1 Total number of comparisons: 2 ( n (n-1) ) / 2 = n 2 –n = O(n 2 ) Total number of comparisons: 2 ( n (n-1) ) / 2 = n 2 –n = O(n 2 )
Lecture 7: Register Renaming 18 = = T 7R R 7R src 7R dst 0 dst 6 dst 7 = = = = = = = = = = = = Gate delay reduced down to O(log 2 N) Gate delay reduced down to O(log 2 N)
Lecture 7: Register Renaming 19 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT dst 0 dst 1 dst 2 dst 3 != use dst 1 != use dst 0 != use dst 2 use dst 3 1 Condition: use mapping if instruction is last writer to the register
Lecture 7: Register Renaming 20 ARF R3 RAT R3 PRF T42 Architected register file contains the committed/non-speculative processor state When an instruction commits, it updates the ARF with the new value The ARF now contains the correct value; update the RAT T42 is no longer needed, return to the physical register free pool Free Pool
Lecture 7: Register Renaming 21 ARF R3 RAT R3 PRF Free Pool T42 T17 Update ARF as usual Deallocate physical register Don’t touch the RAT! (Someone else is the most recent writer to R3) At some point in the future, the newer writer of R3 commits Deallocate physical register This instruction was the most recent writer, now update the RAT
Unified with the ROB Lecture 7: Register Renaming 22 inst data inst data inst data inst data inst data inst data inst data inst data inst data inst data ROBPRF ROB_head ROB_tail Instructions in program order oldest
Free registers = all entries from ROB_tail to ROB_head – 1 Instructions allocated into ROB in-order, so physical registers also allocated in same order –dst i = T [ROB_head] –dst i+1 = T [ (ROB_head +1) % ROB_size ] –dst i+2 = T [ (ROB_head +2) % ROB_size ] –… –dst i+N-1 = T [ (ROB_head +N-1) % ROB_size ] Lecture 7: Register Renaming 23
No need to explicitly manage free pool –just increment ROB_tail as physical registers are allocated, increment ROB_head as registers are deallocated Inefficiency: allocate registers to all instructions –Branches, stores (and some other insts) don’t need physical registers Asymmetric datapath – sometimes read values from ARF, sometimes from the PRF –requires both structures to be heavily ported Lecture 7: Register Renaming 24
Combine both ARF and PRF into a single register file –Before, ARF and PRF could be the same hardware structure, but they have distinct name spaces e.g., ARF (R0-R7) mapped to T0-T7 and PRF mapped to T8-T99 –For a unified RF, the committed R0 could be mapped anywhere (T0-T99) Need some way to track the “committed” state Lecture 7: Register Renaming 25
Lecture 7: Register Renaming 26 R0 Speculative RAT R1 R2 R3 R4 R5 R6 R7 R0 Committed RAT R1 R2 R3 R4 R5 R6 R7 The committed RAT along with the pointed at registers implement the logical equivalent of the ARF The speculative RAT tracks the locations of the most recent version of each architected register Both RATs may point to the same physical location (R0, R5): the most recent writer has also committed
Lecture 7: Register Renaming 27 T0 T1 T2 T3 T4 T5 T6 T7 T8 R0 Speculative RAT R1 R2 R3 R4 R5 R6 R7 R0 Committed RAT R1 R2 R3 R4 R5 R6 R7 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 Register File T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 Free Pool A: R1 = R2 + R4 T8 = T2 + T4 ROB A A B: R4 = R2 – R7 T9 = T2 + T7 B B C: R2 = R1 * R4 T10 = T8 * T9 C C D: R1 = R1 + #1 T11 = T8 + #1 D D T1 T4 E: R7 = R4 / R1 T1 = T9 + T11 E E
Previous example showed a stack data structure (LIFO) Lecture 7: Register Renaming 28 T9 T1 T34 T25 T23 T17 T8 To 4-wide Rename T9 T1 T34 T25 T23 T17 T8 T28 T13 From commit To 4-wide Rename TOS 3 regs allocated 3 regs allocated Stack HW is complex due to need to simultaneously read and write the top-of-stack Stack HW is complex due to need to simultaneously read and write the top-of-stack
A queue structure (FIFO) is easier to implement –independent reading/writing of head and tail Lecture 7: Register Renaming 29 T9 T1 T34 T25 T23 T17 T8 Pool HeadPool Tail 3 regs allocated 2 regs deallocated T13 T28 Corner case still exists when pool is empty –Either stall rename for one cycle or need more complex HW to bypass dealloc’d registers to the renamer
Lecture 7: Register Renaming 30 br ARF RAT ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine ?!? The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state
Lecture 7: Register Renaming 31 br ARF RAT ?!? Correct path instructions from fetch; can’t rename because RAT is wrong foo X ARF now corresponds to the state right before the next instruction to be renamed (foo) Allow all instructions to execute and commit; ARF corresponds to last committed instruction Reset RAT so that all mappings refer to the ARF Resume renaming the new correct- path instructions from fetch Pros: Very simple to implement Cons: Performance loss due to stalls Pros: Very simple to implement Cons: Performance loss due to stalls
Lecture 7: Register Renaming 32 br ARF RAT At each branch, make a copy of the RAT (register mapping at the time of the branch) RAT On a misprediction: Checkpoint Free Pool 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint foo 4. resume renaming
No need to stall front-end (?) –need to “flash copy” RAT both for making checkpoints and recovering –need some way to “hunt down” wrong-path checkpoints for deallocation can “walk” the ROB, but this may take more than one cycle which may introduce stalls; still faster than stall-and-drain More hardware –need one checkpoint per branch –what if the code has nothing but branches? worst case needs one checkpoint per ROB entry can assign one checkpoint per branch color –stall front-end when out of branch colors/checkpoints Lecture 7: Register Renaming 33
Each register-writing ROB entry tracks two physical registers 1.Its allocated destination register 2.The previous physical register mapping for it architected register Example –R1 mapped to T23 –Rename new instruction X, which overwrites R1 R1 now mapped to T19 X also records the value of an “undo mapping” of T23 –Recovery: walk ROB backwards applying the undo mappings Lower overhead: don’t need full copies of the RAT Slower?: need to walk the ROB Flexibility: can recover to any instruction; not just branches Lecture 7: Register Renaming 34
For ROB-based PRF, deallocation is simple: –ROB_tail reset to point right after the mispredicted branch For unified RF, allocated registers may be anywhere in the register file Lecture 7: Register Renaming 35 br st br PReg Free Pool Committed RAT Some sort of ROB walk still required to deallocate the wrong-path PRegs; do at same time with checkpoint deallocation Some sort of ROB walk still required to deallocate the wrong-path PRegs; do at same time with checkpoint deallocation
Lecture 7: Register Renaming 36 RAT Highly ported SRAM RAT Highly ported SRAM 3N ports: 2N read, 1N write 1 entry per architected register: includes int, FP, MMX/SSE, lo/hi (MIPS), control registers, FP status, predicate registers (IA64), flags (x86), etc. Each entry is log 2 |PRF| bits wide, plus 1 valid bit when RF not unified (!valid register is in the ARF) Typical N=3,4 |ARF| = |PRF| = 100± Only bytes, but 9-12 ports SRAM latency typically quadratic w.r.t. #ports Dep Check Logic Almost full pairwise dependency checks: O(N 2 ) comparisons Dep Check Logic Almost full pairwise dependency checks: O(N 2 ) comparisons
SRAM lookup easily pipelined Dependency check is just combinatorial logic; easily pipelined Lecture 7: Register Renaming 37 REN1REN2 ABCD renamed ABCD What if there’s a dependency between groups? EFGH EFGH ABCD ABCD ABCD haven’t updated the RAT when EFGH reads the RAT ABCDABCD
Similar to intra-group dependency checking, now must perform inter-group dependency checking Lecture 7: Register Renaming 38 REN1REN2 ABCD ABCD EFGH ABCD ABCD EFGH EFGH Register mappings if no dependencies Overrides if dependency exists between ABCD Overrides if dependency exists between ABCD and EFGH ABCD EFGH
Lecture 7: Register Renaming 39 Original renaming Overhead due to pipelined rename 1ns/cycle, 1GHz 0.5ns/cycle, 2GHz Original renaming 0.32ns/cycle, 3.14GHz
More stages –higher branch mispredict penalty –a lot more implementation complexity dep check with previous group, prev-prev group, etc. pipeline control logic, latching overhead more circuits ( area, power), more design effort Higher frequency –more performance if pipeline not overly exposed need sufficiently high branch prediction accuracy power goes up even more (P=½CV 2 f ) –This is on top of the extra power for the extra circuits –Extra logic effectively increases the C term Lecture 7: Register Renaming 40
How big should the physical register file be? –ROB-based: PRF entries == ROB entries –Unified: ??? Should have one register per instruction –How to count instructions? –Every instruction from rename to retire instructions in fetch/decode stages haven’t been renamed, and therefore don’t need physical registers Not every instruction needs a register (branches, stores) How many instructions does this add up to? –N × Stages(Rename to Dispatch) + ROB_size –Less those expected to not need destinations Lecture 7: Register Renaming 41
Lecture 7: Register Renaming 42 IF ID REN Disp RS ROB Commit 1. No register allocated 2. Register allocated, but contents are bogus 3. Register contains valid data 4. Overwriter commits; register has stale value; deallocate This is the only time a physical storage location is really needed This is the only time a physical storage location is really needed Actually, only needed until last consumer reads the value Actually, only needed until last consumer reads the value PRF needs to be large enough for all instructions in Region 2, but none of the registers will contain anything useful! PRF needs to be large enough for all instructions in Region 2, but none of the registers will contain anything useful!