PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Workshop on Power and Timing Modeling, Optimization and Simulation, Sevilla, Spain, September 12, 2002
PATMOS’02 Presentation Outline ROB complexities and sources of power dissipation Low-power ROB design: Dynamic ROB resizing Use of energy-efficient comparators Use of zero-byte encoding Results Concluding remarks
PATMOS’02 What This Work is All About In some of today’s processors, physical registers are implemented as the Reorder Buffer (ROB) slots Example: Pentium III Consequences ROB is a complex, multi-ported structure, dissipating a non-trivial fraction of the total chip power Main goal of this work: Reduce power dissipation of the ROB without sacrificing performance
PATMOS’02 Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB
PATMOS’02 ROB Structures and Complexities Reorder Buffer (ROB) is used for: Supporting precise interrupts Maintaining speculative register values A large number of read and write ports is required. For a W-way CPU: W write ports to set up entries W read ports for instruction commitment 2W read ports for reading the source operands W write ports for writing the results
PATMOS’02 Sources of ROB Power Dissipation Establishment of ROB entries for dispatched instructions Readout of the valid sources from the ROB, including the associative search Writing the results into the ROB slots Instruction commitment Clearing the ROB on mispredictions (this is small)
PATMOS’02 Sources of ROB Power Dissipation 21.5% 34.5% 35.7% 8.1%
PATMOS’02 ROBs in Modern CPUs: Summary 80 entries or more in current implementations 5W ports for a W-way CPU Large fraction of total chip power is dissipated within the ROB (27% according to Folegnani and Gonzalez, ISCA’01). It is important to explore mechanisms for the ROB power minimization
PATMOS’02 What Do We Propose ? Three relatively independent techniques to reduce the power dissipation within the ROB: Dynamic ROB resizing Use of energy-efficient comparators Use of zero-byte encoding
PATMOS’02 ROB Usage in Superscalar Datapath: Example (fpppp) Main idea: Where ROB is underutilized, parts of it can be turned off to save power.
PATMOS’02 Incremental ROB Allocation/Deallocation The ROB is implemented as a set of independent partitions Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses All partitions have associative addressing logic
PATMOS’02 Partitioned ROB Organization Bitlines or address lines within a partition Precharger array Input/output drivers Bypass switch array Non-associative part Associative part Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Bitlines Address lines Through line Bypass switch Partition 1 Partition 2 Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Partition 3
PATMOS’02 Sampling and Downsizing Strategies Downsizing decisions are taken at the end of update period Update periods have a fixed duration of UP cycles Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles cycles SP UP
PATMOS’ Actual occupancy Allocated entries SP SP / UPSP SP / UP 0 A Resizing Example (SP=4, UP=16)
PATMOS’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)
PATMOS’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)
PATMOS’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)
PATMOS’02 Upsizing Strategy Count the number of cycles when dispatch blocks because the ROB is full. If the counter exceeds OT (Overflow Threshold), add one partition -upsizing is more aggressive than downsizing – reduces hit on performance Reset the overflow counter to 0 at the beginning of a new UP (Update Period)
PATMOS’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)
PATMOS’ SP SP / UPSP SP / UP A Resizing Example (SP=4, UP=16, OT=4) Actual occupancy Allocated entries
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 1 A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 123 A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP SP / UP
PATMOS’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP
PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234
PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234
PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234
PATMOS’02 Summary of the Control Strategy Only three parameters used for control: OT (Overflow Threshold) UP (Update Period) SP (Sample Period) Less than 1% power overhead for control logic Advantages: Can easily achieve a desired power/performance tradeoff by adjusting OT and UP Monitoring on a cycle-by-cycle basis is avoided – done once every SP cycles
PATMOS’02 The Use of Energy-Efficient Comparators Traditional pull-down comparators dissipate energy (through discharging the output node) on a mismatch in any bit position. If mismatches are much more frequent than matches, this is energy-inefficient We proposed a number of dissipate-on- match comparator designs (Kucuk et.al., ISLPED’01 and Ergin et.al., ICCD’02)
PATMOS’02 The Use of Energy-Efficient Comparators If an associative addressing is used within the ROB, the architectural register ids are compared. Number of bits matching % of total cases 2 LSBs4 LSBsAll 6 bits Avg. SPECint 95 23%14%12% Avg. SPECfp 95 26%16%11% Avg. all SPEC 95 25%15%11.5%
PATMOS’02 The Use of Energy-Efficient Comparators As seen from the distribution, the mismatches occur more frequently than matches if the architectural register addresses are compared within the ROB To exploit this, we used the design of Kucuk et.al., ISLPED’01. Two-stage Domino logic The first stage compares the 4 LSBs. Unless they match (only 12% of the cases), no dissipation occurs Significant energy reduction results! Other designs can be used to speed things up by avoiding domino-style logic (Ergin et.al., ICCD’02).
PATMOS’02 The Use of Zero-Byte Encoding A large percentage of bytes travelling on result, commit and dispatch buses contain all zeroes. This can be exploited by not writing such bytes into the ROB and not reading them from the ROB. A separate bit (Zero Indicator Bit, ZIB) is used to distinguish such bytes. If a byte contains all zeroes, only the ZIB bit is read and written instead of 8 bits. Circuits are similar to those presented in Ghose et.al. (Koolchips, 2000) and Zhang et.al. (MICRO’00).
PATMOS’02 Percentage of Bytes Containing All 0’s: Results On the average across all SPEC 95 benchmarks: In sources being read from the ROB: 43% In the result values written into the ROB and committed from the ROB: 41.5%
PATMOS’02 Percentage of Bytes Containing All 0’s: Results
PATMOS’02 Experimental Setup (AccuPower, DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information
PATMOS’02 Summary of the Results (SPEC 95 averages) Dynamic ROB resizing: UP=2048 cycles SP=32 cycles IPC drop 0.06% for OT=128 IPC drop 3.14% for OT=2048 Power savings range from 56% to 63%
PATMOS’02 Summary of the Results (SPEC 95 averages) Comparators: 41% comparator power savings and 13% overall ROB power savings Zero-byte encoding: 17% power savings Three techniques combined: 70-76% power savings with negligible impact on performance
PATMOS’02 Concluding Remarks Significant power reduction within the ROB can be realized by: Dynamic ROB resizing Use of dissipate-on-match comparators Use of zero-byte encoding Combined Power savings are in the range of 70-76% with very small impact on performance. Finally, all three techniques increase the ROB complexity. Can the ROB complexity be reduced?
PATMOS’02 Concluding Remarks Yes ! With small IPC drop, we can totally eliminate the ROB read ports needed for reading the source operand values. 2W out of the 5W ROB ports are these Details are in Kucuk, Ponomarev and Ghose, ICS’02. Combined, the techniques presented here and the solution of ICS’02 can make the case for reconsidering the architecture integrating physical register file and the ROB as a choice for implementing future high- performance microprocessors.