Physical Register Inlining (PRI)

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Lecture 7: Register Renaming. 2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Lecture: Out-of-order Processors

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Lynn Choi Dept. Of Computer and Electronics Engineering

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Out-of-Order Commit Processors

Half-Price Architecture

Power-Aware Operand Delivery

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Tolerating Long Latency Instructions

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Out-of-Order Commit Processor

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Out-of-Order Commit Processors

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Physical Register Inlining (PRI) Mikko H. Lipasti1, Brian Mestan2, and Erika Gunadi1 1Department of Electrical and Computer Engineering University of Wisconsin—Madison 2IBM Microelectronics IBM Corporation – Austin, TX http://www.ece.wisc.edu/~pharm

Demand for Large Register Files Dcd Rnm Sched Disp RF Exe Retire Commit Fetch Instruction Window Deeper Pipeline Increasing pressure on Register File Lots of attention / prior work

Challenges with Scaling Register Files Additional pipe stages needed for access Increases branch misprediction penalty Increases scheduling misprediction penalty Requires additional bypass logic Further increases pipeline depth Increases the demand for more registers

Physical Register Lifetime width4 width8 Managed inefficiently

Prior Work Register file caching [Swenson et al. 1988, Zalamea et al. 2000, Postiff et al. 2001, Cruz et al. 2000, Borch et al. 2002] Late Allocation [Gonzalez et al. 1998, Monreal et al. 1999] Efficient Management Early deallocation [Moudgill et al. 1993] Program semantics [Martin et al. 1997, Lo et al. 1999] Checkpointing [Martinez et al. 2002, Akkary et al. 2003] Value-based optimizations [Jourdan et al. 1998]

Early Deallocation Moudgill et al. 1993 Focused on “last read to release” Avoid waiting for the next writer to commit Deallocate registers as soon as: Complete (complete flag) Unmapped (unmap flag) No outstanding readers (reference counter) Still requires next writer to enter the window

Physical Register Inlining Exploits narrow operands: sizable fraction of operands can be stored in less than 8 bits [Canal et al. 2000] Often fewer bits than needed to specify physical registers Store the value instead of the pointer Stores narrow values in map table Reduces physical register lifetime

Operand Significance Also have FP graph in the paper – exploits 0.0/1.0 (54%)

Outline Motivation Prior Work Physical Register Inlining Experiments Quick Microarchitectural Review Modifications Needed PRI + early deallocation Experiments Conclusions

Microarchitectural Review Register Rename/Map Tables Maps logical names to physical names Removes false name dependences Two common types: RAM and CAM CAM map is positional Not suitable for storing values . RAM map CAM map ? Logical reg # V Phys reg # 1 1 ? 2 2 Logical reg # ? Logical reg # . . L ? Phys reg #

Microarchitectural Review Allocating and Freeing Physical Registers Allocates physical register at decode – map table entry is updated Releases physical register when next writer is committed Checkpoint and Recovery of Register Map Optimization to reduce branch misprediction penalty

Modifications to Data Flow Fetch Dcd Rnm Queue Sched Disp RF Exe Retire Commit Map Payload RAM ALU Narrow? Execution stage must allow both operands to be read from payload RAM Already supports one immediate operands Sign extension between payload RAM and the ALU input Narrow checking logic to verify if the operands are narrow Narrow datapath back to the map table

Modifications to Map Table Registers freed from the retire/wb stage and commit stage Tolerant of duplicate deallocations of the same physical register Once as narrow, again at next write commit Map entries need to be writable from rename stage and retire/wb stage

Stale Pointer Problem MAP Checkpoints PRF copy ROB IssueQ Deallocating physical registers early makes these pointers stale Equivalent to the garbage collection issue Two choices Delay deallocation until pointers not valid (refcount) Update all pointers (ideal IPC)

Map table checkpoints problem Map table checkpoints need to be updated in case of narrow operands write Lazy update Complex, but not cycle time critical Checkpoint reference counting Similar to Akkary et al. Delays deallocation, reduces IPC benefit slightly

Example of WAR Violation Load p1 <= MEM[p7] And p2 <= p3 & p4 narrow Add p5 <= p1 + p2 WAR violation Or p2 <= p8 & p9 Rare, but frequent enough to affect performance Must have efficient solution

Rename Table WAW Hazards Fetch Decode Execute Retire Commit r3 = r1 + r2 p5 = p1 & p2 p4 = p1 + p2 p4 = p1 + p2 r3 = r1 & r2 narrow MAP ROB (Dst) r3 p3 p3p4p5 p3p4 p4 p5 WAW! WAW hazards Writes narrow value to a remapped map entry Must ensure that the map entry has not been remapped

Integrating PRI with Early Deallocation Not all operands are narrow Reduces register lifetime further Adds unmap flags and complete flags [Moudgill et al. 1993] width4 baseline PRI PRI+ER

Machine Model 4-wide fetch, issue, commit 512 ROB, 256 LSQ 32-entry scheduler 64 physical registers Speculative scheduling with selective recovery Combined bimodal branch predictor 32KB IL1, 32KB DL1, 512KB L2 7 bits PRI for integer, 1 bit PRI for FP

Speed Up for Integer Benchmarks PRI (checkpoint + reference counting) performs substantially better than previous work Reference + checkpoint counting scheme performs close enough with ideal case (ideal + lazy) Combining PRI and ER increases the performance further

PRF Occupancy for Int. Benchmarks PRI reduces more register file pressure than the previous work (ER) Combining PRI and ER reduces the pressure more

Speed Up for FP Benchmark Ammp benchmark -> physical registers are not the performance bottleneck Art benchmark -> a lot of narrow operands to exploit Wupwise benchmark -> few narrow operands

Conclusion PRI can lead to substantial performance improvement for both integer and fp benchmarks Ideal Update of stale pointers provides marginal benefit Reference +checkpoint counting is the best choice

Future Work Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez et al. 1998] Interaction of PRI with software-based techniques to deallocate dead registers PRI enables a binary-compatible mechanism for the compiler to communicate the fact that a register is dead to the hardware Compiler can simply insert load immediate of narrow values to any register that seems dead

Questions? Thank you

Machine Model