PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

Slides:

Advertisements

Similar presentations

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Lecture 2-Berkeley RISC Penghui Zhang Guanming Wang Hang Zhang.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Chapter 12 Pipelining Strategies Performance Hazards.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

Lecture 8 Shelving in Superscalar Processors (Part 1)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Runtime Software Power Estimation and Minimization Tao Li.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

PipeliningPipelining Computer Architecture (Fall 2006)

Dynamic Associative Caches:

Dynamic Scheduling Why go out of style?

CSL718 : Superscalar Processors

SECTIONS 1-7 By Astha Chawla

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

15-740/ Computer Architecture Lecture 5: Precise Exceptions

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Workshop on Power and Timing Modeling, Optimization and Simulation, Sevilla, Spain, September 12, 2002

PATMOS’02 Presentation Outline ROB complexities and sources of power dissipation Low-power ROB design: Dynamic ROB resizing Use of energy-efficient comparators Use of zero-byte encoding Results Concluding remarks

PATMOS’02 What This Work is All About In some of today’s processors, physical registers are implemented as the Reorder Buffer (ROB) slots Example: Pentium III Consequences ROB is a complex, multi-ported structure, dissipating a non-trivial fraction of the total chip power Main goal of this work: Reduce power dissipation of the ROB without sacrificing performance

PATMOS’02 Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB

PATMOS’02 ROB Structures and Complexities Reorder Buffer (ROB) is used for: Supporting precise interrupts Maintaining speculative register values A large number of read and write ports is required. For a W-way CPU: W write ports to set up entries W read ports for instruction commitment 2W read ports for reading the source operands W write ports for writing the results

PATMOS’02 Sources of ROB Power Dissipation Establishment of ROB entries for dispatched instructions Readout of the valid sources from the ROB, including the associative search Writing the results into the ROB slots Instruction commitment Clearing the ROB on mispredictions (this is small)

PATMOS’02 Sources of ROB Power Dissipation 21.5% 34.5% 35.7% 8.1%

PATMOS’02 ROBs in Modern CPUs: Summary 80 entries or more in current implementations 5W ports for a W-way CPU Large fraction of total chip power is dissipated within the ROB (27% according to Folegnani and Gonzalez, ISCA’01). It is important to explore mechanisms for the ROB power minimization

PATMOS’02 What Do We Propose ? Three relatively independent techniques to reduce the power dissipation within the ROB: Dynamic ROB resizing Use of energy-efficient comparators Use of zero-byte encoding

PATMOS’02 ROB Usage in Superscalar Datapath: Example (fpppp) Main idea: Where ROB is underutilized, parts of it can be turned off to save power.

PATMOS’02 Incremental ROB Allocation/Deallocation The ROB is implemented as a set of independent partitions Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses All partitions have associative addressing logic

PATMOS’02 Partitioned ROB Organization Bitlines or address lines within a partition Precharger array Input/output drivers Bypass switch array Non-associative part Associative part Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Bitlines Address lines Through line Bypass switch Partition 1 Partition 2 Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Partition 3

PATMOS’02 Sampling and Downsizing Strategies Downsizing decisions are taken at the end of update period Update periods have a fixed duration of UP cycles Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles cycles SP UP

PATMOS’ Actual occupancy Allocated entries SP SP / UPSP SP / UP 0 A Resizing Example (SP=4, UP=16)

PATMOS’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

PATMOS’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

PATMOS’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

PATMOS’02 Upsizing Strategy Count the number of cycles when dispatch blocks because the ROB is full. If the counter exceeds OT (Overflow Threshold), add one partition -upsizing is more aggressive than downsizing – reduces hit on performance Reset the overflow counter to 0 at the beginning of a new UP (Update Period)

PATMOS’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

PATMOS’ SP SP / UPSP SP / UP A Resizing Example (SP=4, UP=16, OT=4) Actual occupancy Allocated entries

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4)

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 1 A Resizing Example (SP=4, UP=16, OT=4)

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)

PATMOS’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 123 A Resizing Example (SP=4, UP=16, OT=4)

PATMOS’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP SP / UP

PATMOS’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP

PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

PATMOS’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

PATMOS’02 Summary of the Control Strategy Only three parameters used for control: OT (Overflow Threshold) UP (Update Period) SP (Sample Period) Less than 1% power overhead for control logic Advantages: Can easily achieve a desired power/performance tradeoff by adjusting OT and UP Monitoring on a cycle-by-cycle basis is avoided – done once every SP cycles

PATMOS’02 The Use of Energy-Efficient Comparators Traditional pull-down comparators dissipate energy (through discharging the output node) on a mismatch in any bit position. If mismatches are much more frequent than matches, this is energy-inefficient We proposed a number of dissipate-on- match comparator designs (Kucuk et.al., ISLPED’01 and Ergin et.al., ICCD’02)

PATMOS’02 The Use of Energy-Efficient Comparators If an associative addressing is used within the ROB, the architectural register ids are compared. Number of bits matching  % of total cases 2 LSBs4 LSBsAll 6 bits Avg. SPECint 95 23%14%12% Avg. SPECfp 95 26%16%11% Avg. all SPEC 95 25%15%11.5%

PATMOS’02 The Use of Energy-Efficient Comparators As seen from the distribution, the mismatches occur more frequently than matches if the architectural register addresses are compared within the ROB To exploit this, we used the design of Kucuk et.al., ISLPED’01. Two-stage Domino logic The first stage compares the 4 LSBs. Unless they match (only 12% of the cases), no dissipation occurs Significant energy reduction results! Other designs can be used to speed things up by avoiding domino-style logic (Ergin et.al., ICCD’02).

PATMOS’02 The Use of Zero-Byte Encoding A large percentage of bytes travelling on result, commit and dispatch buses contain all zeroes. This can be exploited by not writing such bytes into the ROB and not reading them from the ROB. A separate bit (Zero Indicator Bit, ZIB) is used to distinguish such bytes. If a byte contains all zeroes, only the ZIB bit is read and written instead of 8 bits. Circuits are similar to those presented in Ghose et.al. (Koolchips, 2000) and Zhang et.al. (MICRO’00).

PATMOS’02 Percentage of Bytes Containing All 0’s: Results On the average across all SPEC 95 benchmarks: In sources being read from the ROB: 43% In the result values written into the ROB and committed from the ROB: 41.5%

PATMOS’02 Percentage of Bytes Containing All 0’s: Results

PATMOS’02 Experimental Setup (AccuPower, DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information

PATMOS’02 Summary of the Results (SPEC 95 averages) Dynamic ROB resizing: UP=2048 cycles SP=32 cycles IPC drop 0.06% for OT=128 IPC drop 3.14% for OT=2048 Power savings range from 56% to 63%

PATMOS’02 Summary of the Results (SPEC 95 averages) Comparators: 41% comparator power savings and 13% overall ROB power savings Zero-byte encoding: 17% power savings Three techniques combined: 70-76% power savings with negligible impact on performance

PATMOS’02 Concluding Remarks Significant power reduction within the ROB can be realized by: Dynamic ROB resizing Use of dissipate-on-match comparators Use of zero-byte encoding Combined Power savings are in the range of 70-76% with very small impact on performance. Finally, all three techniques increase the ROB complexity. Can the ROB complexity be reduced?

PATMOS’02 Concluding Remarks Yes ! With small IPC drop, we can totally eliminate the ROB read ports needed for reading the source operand values. 2W out of the 5W ROB ports are these Details are in Kucuk, Ponomarev and Ghose, ICS’02. Combined, the techniques presented here and the solution of ICS’02 can make the case for reconsidering the architecture integrating physical register file and the ROB as a choice for implementing future high- performance microprocessors.