Physical Register Inlining (PRI) Mikko H. Lipasti1, Brian Mestan2, and Erika Gunadi1 1Department of Electrical and Computer Engineering University of Wisconsin—Madison 2IBM Microelectronics IBM Corporation – Austin, TX http://www.ece.wisc.edu/~pharm
Demand for Large Register Files Dcd Rnm Sched Disp RF Exe Retire Commit Fetch Instruction Window Deeper Pipeline Increasing pressure on Register File Lots of attention / prior work
Challenges with Scaling Register Files Additional pipe stages needed for access Increases branch misprediction penalty Increases scheduling misprediction penalty Requires additional bypass logic Further increases pipeline depth Increases the demand for more registers
Physical Register Lifetime width4 width8 Managed inefficiently
Prior Work Register file caching [Swenson et al. 1988, Zalamea et al. 2000, Postiff et al. 2001, Cruz et al. 2000, Borch et al. 2002] Late Allocation [Gonzalez et al. 1998, Monreal et al. 1999] Efficient Management Early deallocation [Moudgill et al. 1993] Program semantics [Martin et al. 1997, Lo et al. 1999] Checkpointing [Martinez et al. 2002, Akkary et al. 2003] Value-based optimizations [Jourdan et al. 1998]
Early Deallocation Moudgill et al. 1993 Focused on “last read to release” Avoid waiting for the next writer to commit Deallocate registers as soon as: Complete (complete flag) Unmapped (unmap flag) No outstanding readers (reference counter) Still requires next writer to enter the window
Physical Register Inlining Exploits narrow operands: sizable fraction of operands can be stored in less than 8 bits [Canal et al. 2000] Often fewer bits than needed to specify physical registers Store the value instead of the pointer Stores narrow values in map table Reduces physical register lifetime
Operand Significance Also have FP graph in the paper – exploits 0.0/1.0 (54%)
Outline Motivation Prior Work Physical Register Inlining Experiments Quick Microarchitectural Review Modifications Needed PRI + early deallocation Experiments Conclusions
Microarchitectural Review Register Rename/Map Tables Maps logical names to physical names Removes false name dependences Two common types: RAM and CAM CAM map is positional Not suitable for storing values . RAM map CAM map ? Logical reg # V Phys reg # 1 1 ? 2 2 Logical reg # ? Logical reg # . . L ? Phys reg #
Microarchitectural Review Allocating and Freeing Physical Registers Allocates physical register at decode – map table entry is updated Releases physical register when next writer is committed Checkpoint and Recovery of Register Map Optimization to reduce branch misprediction penalty
Modifications to Data Flow Fetch Dcd Rnm Queue Sched Disp RF Exe Retire Commit Map Payload RAM ALU Narrow? Execution stage must allow both operands to be read from payload RAM Already supports one immediate operands Sign extension between payload RAM and the ALU input Narrow checking logic to verify if the operands are narrow Narrow datapath back to the map table
Modifications to Map Table Registers freed from the retire/wb stage and commit stage Tolerant of duplicate deallocations of the same physical register Once as narrow, again at next write commit Map entries need to be writable from rename stage and retire/wb stage
Stale Pointer Problem MAP Checkpoints PRF copy ROB IssueQ Deallocating physical registers early makes these pointers stale Equivalent to the garbage collection issue Two choices Delay deallocation until pointers not valid (refcount) Update all pointers (ideal IPC)
Map table checkpoints problem Map table checkpoints need to be updated in case of narrow operands write Lazy update Complex, but not cycle time critical Checkpoint reference counting Similar to Akkary et al. Delays deallocation, reduces IPC benefit slightly
Example of WAR Violation Load p1 <= MEM[p7] And p2 <= p3 & p4 narrow Add p5 <= p1 + p2 WAR violation Or p2 <= p8 & p9 Rare, but frequent enough to affect performance Must have efficient solution
Rename Table WAW Hazards Fetch Decode Execute Retire Commit r3 = r1 + r2 p5 = p1 & p2 p4 = p1 + p2 p4 = p1 + p2 r3 = r1 & r2 narrow MAP ROB (Dst) r3 p3 p3p4p5 p3p4 p4 p5 WAW! WAW hazards Writes narrow value to a remapped map entry Must ensure that the map entry has not been remapped
Integrating PRI with Early Deallocation Not all operands are narrow Reduces register lifetime further Adds unmap flags and complete flags [Moudgill et al. 1993] width4 baseline PRI PRI+ER
Machine Model 4-wide fetch, issue, commit 512 ROB, 256 LSQ 32-entry scheduler 64 physical registers Speculative scheduling with selective recovery Combined bimodal branch predictor 32KB IL1, 32KB DL1, 512KB L2 7 bits PRI for integer, 1 bit PRI for FP
Speed Up for Integer Benchmarks PRI (checkpoint + reference counting) performs substantially better than previous work Reference + checkpoint counting scheme performs close enough with ideal case (ideal + lazy) Combining PRI and ER increases the performance further
PRF Occupancy for Int. Benchmarks PRI reduces more register file pressure than the previous work (ER) Combining PRI and ER reduces the pressure more
Speed Up for FP Benchmark Ammp benchmark -> physical registers are not the performance bottleneck Art benchmark -> a lot of narrow operands to exploit Wupwise benchmark -> few narrow operands
Conclusion PRI can lead to substantial performance improvement for both integer and fp benchmarks Ideal Update of stale pointers provides marginal benefit Reference +checkpoint counting is the best choice
Future Work Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez et al. 1998] Interaction of PRI with software-based techniques to deallocate dead registers PRI enables a binary-compatible mechanism for the compiler to communicate the fact that a register is dead to the hardware Compiler can simply insert load immediate of narrow values to any register that seems dead
Questions? Thank you
Machine Model