Presentation is loading. Please wait.

Presentation is loading. Please wait.

Physical Register Inlining (PRI)

Similar presentations


Presentation on theme: "Physical Register Inlining (PRI)"— Presentation transcript:

1 Physical Register Inlining (PRI)
Mikko H. Lipasti1, Brian Mestan2, and Erika Gunadi1 1Department of Electrical and Computer Engineering University of Wisconsin—Madison 2IBM Microelectronics IBM Corporation – Austin, TX

2 Demand for Large Register Files
Dcd Rnm Sched Disp RF Exe Retire Commit Fetch Instruction Window Deeper Pipeline Increasing pressure on Register File Lots of attention / prior work

3 Challenges with Scaling Register Files
Additional pipe stages needed for access Increases branch misprediction penalty Increases scheduling misprediction penalty Requires additional bypass logic Further increases pipeline depth Increases the demand for more registers

4 Physical Register Lifetime
width4 width8 Managed inefficiently

5 Prior Work Register file caching [Swenson et al. 1988, Zalamea et al. 2000, Postiff et al. 2001, Cruz et al. 2000, Borch et al. 2002] Late Allocation [Gonzalez et al. 1998, Monreal et al. 1999] Efficient Management Early deallocation [Moudgill et al. 1993] Program semantics [Martin et al. 1997, Lo et al. 1999] Checkpointing [Martinez et al. 2002, Akkary et al. 2003] Value-based optimizations [Jourdan et al. 1998]

6 Early Deallocation Moudgill et al. 1993
Focused on “last read to release” Avoid waiting for the next writer to commit Deallocate registers as soon as: Complete (complete flag) Unmapped (unmap flag) No outstanding readers (reference counter) Still requires next writer to enter the window

7 Physical Register Inlining
Exploits narrow operands: sizable fraction of operands can be stored in less than 8 bits [Canal et al. 2000] Often fewer bits than needed to specify physical registers Store the value instead of the pointer Stores narrow values in map table Reduces physical register lifetime

8 Operand Significance Also have FP graph in the paper – exploits 0.0/1.0 (54%)

9 Outline Motivation Prior Work Physical Register Inlining Experiments
Quick Microarchitectural Review Modifications Needed PRI + early deallocation Experiments Conclusions

10 Microarchitectural Review
Register Rename/Map Tables Maps logical names to physical names Removes false name dependences Two common types: RAM and CAM CAM map is positional Not suitable for storing values . RAM map CAM map ? Logical reg # V Phys reg # 1 1 ? 2 2 Logical reg # ? Logical reg # . . L ? Phys reg #

11 Microarchitectural Review
Allocating and Freeing Physical Registers Allocates physical register at decode – map table entry is updated Releases physical register when next writer is committed Checkpoint and Recovery of Register Map Optimization to reduce branch misprediction penalty

12 Modifications to Data Flow
Fetch Dcd Rnm Queue Sched Disp RF Exe Retire Commit Map Payload RAM ALU Narrow? Execution stage must allow both operands to be read from payload RAM Already supports one immediate operands Sign extension between payload RAM and the ALU input Narrow checking logic to verify if the operands are narrow Narrow datapath back to the map table

13 Modifications to Map Table
Registers freed from the retire/wb stage and commit stage Tolerant of duplicate deallocations of the same physical register Once as narrow, again at next write commit Map entries need to be writable from rename stage and retire/wb stage

14 Stale Pointer Problem MAP Checkpoints PRF copy ROB IssueQ Deallocating physical registers early makes these pointers stale Equivalent to the garbage collection issue Two choices Delay deallocation until pointers not valid (refcount) Update all pointers (ideal IPC)

15 Map table checkpoints problem
Map table checkpoints need to be updated in case of narrow operands write Lazy update Complex, but not cycle time critical Checkpoint reference counting Similar to Akkary et al. Delays deallocation, reduces IPC benefit slightly

16 Example of WAR Violation
Load p1 <= MEM[p7] And p2 <= p3 & p4 narrow Add p5 <= p1 + p2 WAR violation Or p2 <= p8 & p9 Rare, but frequent enough to affect performance Must have efficient solution

17 Rename Table WAW Hazards
Fetch Decode Execute Retire Commit r3 = r1 + r2 p5 = p1 & p2 p4 = p1 + p2 p4 = p1 + p2 r3 = r1 & r2 narrow MAP ROB (Dst) r3 p3 p3p4p5 p3p4 p4 p5 WAW! WAW hazards Writes narrow value to a remapped map entry Must ensure that the map entry has not been remapped

18 Integrating PRI with Early Deallocation
Not all operands are narrow Reduces register lifetime further Adds unmap flags and complete flags [Moudgill et al. 1993] width4 baseline PRI PRI+ER

19 Machine Model 4-wide fetch, issue, commit 512 ROB, 256 LSQ
32-entry scheduler 64 physical registers Speculative scheduling with selective recovery Combined bimodal branch predictor 32KB IL1, 32KB DL1, 512KB L2 7 bits PRI for integer, 1 bit PRI for FP

20 Speed Up for Integer Benchmarks
PRI (checkpoint + reference counting) performs substantially better than previous work Reference + checkpoint counting scheme performs close enough with ideal case (ideal + lazy) Combining PRI and ER increases the performance further

21 PRF Occupancy for Int. Benchmarks
PRI reduces more register file pressure than the previous work (ER) Combining PRI and ER reduces the pressure more

22 Speed Up for FP Benchmark
Ammp benchmark -> physical registers are not the performance bottleneck Art benchmark -> a lot of narrow operands to exploit Wupwise benchmark -> few narrow operands

23 Conclusion PRI can lead to substantial performance improvement for both integer and fp benchmarks Ideal Update of stale pointers provides marginal benefit Reference +checkpoint counting is the best choice

24 Future Work Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez et al. 1998] Interaction of PRI with software-based techniques to deallocate dead registers PRI enables a binary-compatible mechanism for the compiler to communicate the fact that a register is dead to the hardware Compiler can simply insert load immediate of narrow values to any register that seems dead

25 Questions? Thank you

26 Machine Model


Download ppt "Physical Register Inlining (PRI)"

Similar presentations


Ads by Google