ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Slides:



Advertisements
Similar presentations
09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.
ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
CMPE 421 Parallel Computer Architecture
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
CMSC 611: Advanced Computer Architecture
Cache Memory.
CSL718 : Superscalar Processors
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: SMT, Cache Hierarchies
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: SMT, Cache Hierarchies
Ka-Ming Keung Swamy D Ponpandi
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Lecture: SMT, Cache Hierarchies
Patrick Akl and Andreas Moshovos AENAO Research Group
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002

ICS’02 2 Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks

ICS’02 3 What This Work is All About Complex, richly-ported ROBs are common in modern superscalar datapaths Number of ports are aggravated when results are held within ROB slots (Example: Pentium III) ROB complexity reduction is important for reducing power and improving performance ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

ICS’02 4 Pentium III-like Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB

ICS’02 5 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

ICS’02 6 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment

ICS’02 7 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ICS’02 8 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%

ICS’02 9 How Efficiently are the Ports Used ? ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%

ICS’02 10 Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values!

ICS’02 11 Reducing the Number of Read Ports Performance Drop % 3.5%1.0% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg.

ICS’02 12 Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic

ICS’02 13 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ICS’02 14 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ICS’02 15 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ 1 3 ROB

ICS’02 16 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

ICS’02 17 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Area Reduction – 45%

ICS’02 18 Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps

ICS’02 19 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

ICS’02 20 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses:

ICS’02 21 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!

ICS’02 22 Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small

ICS’02 23 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Only 3.5% of the traffic is from SELECTIVE LATE FORWARDING EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!

ICS’02 24 Performance Drop of Simplified ROB Performance Drop % 9.6%3.5%1.0% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 37% 17%

ICS’02 25 IPC Penalty: Source Value Not Accessible within the ROB Forwarding Late Forwarding/ Commitment Lifetime of a Result Value Result Generation time Value within ARF Value within ROB

ICS’02 26 Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports

ICS’02 27 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Architectural Register File

ICS’02 28 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ RETENTION LATCHES ROB

ICS’02 29 The Structure of the Retention Latch Set L ROB slot addresses (L=1 or 2) L-ported CAM field (key = ROB_slot_id) W write ports for writing up to W results in parallel Status L recently-written results (L=1 or 2 works great) Result Values 8 or 16 latches

ICS’02 30 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO

ICS’02 31 Hit Ratios to Retention Latches 42%55%56%62% Hit Ratios bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. Average Hit Ratio:

ICS’02 32 Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions

ICS’02 33 Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing

ICS’02 34 Misprediction Handling: Performance 1.5% Average IPC Drop: IPC

ICS’02 35 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 arch. 3 Src1 arch. 2 ADD Instruction Instruction: ADD R1, R2, R3

ICS’02 36 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB=0 ARF= … …… …… …… … … ROB# /Phys. Rename Table

ICS’02 37 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 17 Rename Table ROB

ICS’02 38 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 17 Rename Table ROB

ICS’02 39 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 0? Rename Table ROB

ICS’02 40 Scenario 1: Traditional Design 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 0? Rename Table ROB

ICS’02 41 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Arch. value …… 3 …… 43 Rename Table ARF

ICS’02 42 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 43 Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … Arch. value …… 3 …… 43 Rename Table ARF

ICS’02 43 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 arch. 3 Src1 arch. 2 ADD Instruction Instruction: ADD R1, R2, R3

ICS’02 44 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB=0 ARF= … …… …… …… … … ROB# /Phys. Rename Table

ICS’02 45 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. value …… 12 …… 7 Rename Table Retention Latches

ICS’02 46 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Rename Table ROB# /Phys. Phys. value …… 12 …… 7 Retention Latches

ICS’02 47 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Rename Table ROB# /Phys. Phys. value …… … …… … MISS Retention Latches

ICS’02 48 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… XX Rename Table ROB ROB# /Phys. Phys. value …… … …… … Retention Latches MISS X: Don’t Care SLF … … 0

ICS’02 49 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… XX Rename Table ROB ROB# /Phys. Phys. value …… … …… … Retention Latches MISS X: Don’t Care SLF … … 1

ICS’02 50 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Arch. value …… 3 …… 43 Rename Table ARF

ICS’02 51 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 43 Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … Arch. value …… 3 …… 43 Rename Table ARF

ICS’02 52 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator Power/energy stats SPICE measures of energy per transition Transition counts, Context information

ICS’02 53 Configuration of the Simulated System Machine width4-way Issue Queue32 entries 96 entriesReorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks

ICS’02 54 Assumed Timings Rename Table lookup for ROB index Rename Table Lookup for ROB index Associative lookup of operand from retention latches using ROB index as a key Source operand read from the ROB Source operand read from the ROB Smaller delay: few latches D1 D2D3 D1 D2 Timing of the baseline modelTiming of the simplified ROB

ICS’02 55 Experimental Results: Effect on Performance Performance Drop % 0.1%-1.6%-1.0%-2.3% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:

ICS’02 56 Experimental Results: Effect on Performance Performance Drop % 3.3%1.7%2.3%1.0% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:

ICS’02 57 Experimental Results: Effect on Power Power Savings % bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 30%23.4%22.2%21%20.2% Avg. Savings:

ICS’02 58 Summary of Results Significantly reduced ROB complexity and power dissipation 45% area reduction 20% to 30% power reduction across SPEC 2000 benchmarks Actual IPC improvements: 1.6% to 2.3% gain across SPEC benchmarks IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)

ICS’02 59 Related Work Value-Aging Buffer (Hu & Martonosi, PACS 2000) Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02) Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) See paper for discussions

ICS’02 60 Conclusions Typical source operand location statistics can be successfully exploited to reduce ROB complexity Significant reduction in ROB area and power – no ROB ports needed for reading source operands IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle

ICS’02 61 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002