ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

Slides:

Advertisements

Similar presentations

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

COMP25212 Advanced Pipelining Out of Order Processors.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

Lecture 8 Shelving in Superscalar Processors (Part 1)

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Instruction Level Parallelism

PowerPC 604 Superscalar Microprocessor

Out of Order Processors

CS203 – Advanced Computer Architecture

Out-of-Order Commit Processors

CMSC 611: Advanced Computer Architecture

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Ka-Ming Keung Swamy D Ponpandi

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Out-of-Order Commit Processors

Sampoorani, Sivakumar and Joshua

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Overview Prof. Eric Rotenberg

Lecture 7 Dynamic Scheduling

October 9, 2003.

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY st International Conference on Computer Design (ICCD’03), October 14 th 2003

ICCD’03 2 – Reorder Buffer (ROB) complexities – Motivation for the low-complexity ROB – Low-complexity ROB designs Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme – Results – Concluding remarks Outline

ICCD’03 3 P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB

ICCD’03 4 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB RB PPC 620-style Superscalar Datapath

ICCD’03 5 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

ICCD’03 6 What This Work is All About – ROB complexity reduction is important for reducing power and improving performance ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles – Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

ICCD’03 7 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

ICCD’03 8 Instruction dispatch P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Architectural Register File F2 Fetch Decode/Dispatch D2 ROB

ICCD’03 9 Reorder Buffer Distribution IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROB Components (ROBCs)

ICCD’03 10 Impact of Distributing the ROB – Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time – Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) – Fits in naturally with a multi-clustered datapath design

ICCD’03 11 – Port conflicts result in performance penalty – Interconnection network is more complex Problems with the earlier Multi-banked RF Schemes

ICCD’03 12 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex and some good news! Problems with the earlier Multi-banked RF Schemes

ICCD’03 13 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex Completely remove source read ports and some good news! Problems with the earlier Multi-banked RF Schemes

ICCD’03 14 Problems with the earlier Multi-banked RF Schemes – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment Totally avoid source read port conflicts – Interconnection network is more complex Completely remove source read ports and some good news!

ICCD’03 15 ROBCs Assigned to Each Function Unit n ROBC # ROBC # m1 21 ROBC #m 1 FU #m FU #2 FU #1 Centralized ROBDistributed ROBCs FU_id offset

ICCD’03 16 Good News:Write port conflicts are avoided ROBC # ROBC # ROBC #m 1 FU #m FU #2 FU #1 1 write port Distributed ROBCs n 11 m1 21 Centralized ROB FU_id offset

ICCD’03 17 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 instruction 5

ICCD’03 18 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD instruction 5

ICCD’03 19 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD reserved instruction 5

ICCD’03 20 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved 5 ADD

ICCD’03 21 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 5

ICCD’03 22 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB reserved 5

ICCD’03 23 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 5

ICCD’03 24 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 AND 5

ICCD’03 25 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 5

ICCD’03 26 Round Robin Scheduling at Dispatch Time n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 13 5

ICCD’03 27 Good News:Avoiding Read Port Conflicts n 1 2 FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction reserved SUB 21 1 read port To commitment 31 AND reserved 5

ICCD’03 28 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction SUB 21 AND 13 MUL 5 Int MUL/DIV ROBC #5

ICCD’03 29 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs 2 1 ADD 11 instruction SUB 21 AND 13 MUL 5 reserved Int MUL/DIV ROBC #5

ICCD’03 30 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction reserved SUB 21 AND MUL Int MUL/DIV ROBC #5 MUL

ICCD’03 31 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5

ICCD’03 32 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5

ICCD’03 33 Round Robin Scheduling at Dispatch Time n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 DIV

ICCD’03 34 Read Port Conflicts at Commitment n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 reserved To commitment CONFLICT: If MUL and DIV wants to commit in the same cycle 1 read port DIV

ICCD’03 35 Distributed ROB Design 1 ROBC Writeback 1 write port to write results

ICCD’03 36 Distributed ROB Design 1 ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

ICCD’03 37 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 38 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator Power/energy stats SPICE measures of energy per transition Transition counts, Context information

ICCD’03 39 Configuration of the Simulated System Machine width4-way Issue Queue32 entries 96 entriesReorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks

ICCD’03 40 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average peak avg.

ICCD’03 41 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average peak avg Number of entries assigned to each ROBC

ICCD’03 42 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average peak avg = 72 entry 8_4_4_4_16 configuration Number of entries assigned to each ROBC

ICCD’03 43 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

ICCD’03 44 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average = 72 entry Number of entries assigned to each ROBC

ICCD’03 45 Reducing performance penalty: 12_6_4_6_20 Configuration ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average SPEC 2000 FP Average SPEC 2000 Average = 96 entry 12_6_4_6_20 configuration Number of entries assigned to each ROBC

ICCD’03 46 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

ICCD’03 47 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 48 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 49 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

ICCD’03 50 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

ICCD’03 51 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%

ICCD’03 52 How Efficiently are the Ports Used ? ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%

ICCD’03 53 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

ICCD’03 54 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

ICCD’03 55 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 1 3 ROB

ICCD’03 56 Distributed Reorder Buffer Scheme IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROBCs

ICCD’03 57 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

ICCD’03 58 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

ICCD’03 59 Completely Eliminating the Source Read Ports on the ROBCs – The Problem: Issue of instructions that require a value stored in a ROBC will stall – Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

ICCD’03 60 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

ICCD’03 61 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs

ICCD’03 62 Performance Drop of Simplified ROBC Design Performance Drop % 9.6% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 37% 17%

ICCD’03 63 IPC Penalty: Source Value Not Accessible within the ROBC Forwarding Late Forwarding/ Commitment Lifetime of a Result Value Result Generation time Value within ARF Value within a ROBC

ICCD’03 64 Improving IPC with No Read Ports – Cache recently generated values in a set of RETENTION LATCHES (RL) – Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports

ICCD’03 65 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs

ICCD’03 66 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding RETENTION LATCHES Holds pointers to entries within ROBCs

ICCD’03 67 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

ICCD’03 68 Distributed ROB Design 2: with Retention Latches ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment Eight, 2-ported FIFO RLs

ICCD’03 69 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

ICCD’03 70 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 1.7%

ICCD’03 71 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 3.8%

ICCD’03 72 Power Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:49%47%23%

ICCD’03 73 Power Results for 12_6_4_6_20 Configuration (Compared to Baseline case with 64 entry Rename Buffers) gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:39%37%20%

ICCD’03 74 Summary of Results – Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB) – ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96 entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry ROB, 64 entry RB)

ICCD’03 75 Conclusions – We introduced a conflict-free distributed Reorder Buffer design – ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty – ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for each ROB components

ICCD’03 76 ~ Thank You~

ICCD’03 77 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY st International Conference on Computer Design (ICCD’03), October 14 th 2003

ICCD’03 78 Related Work – Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization – Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) – Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03 – Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)

ICCD’03 79 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

ICCD’03 80 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment

ICCD’03 81 Reducing ROB Power and Complexity ROB Phys.regs. ROB

ICCD’03 82 LOAD Int MUL Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Distribution Centralized ROB FP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Smaller structures : shorter bitlines, lower capacitive loading, etc. LESS POWER DISSIPATION! Phys.regs.

ICCD’03 83 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Dedicate FUs to ROBCs Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs.

ICCD’03 84 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Fully Distributed Reorder Buffer Scheme Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs. ROBCs

ICCD’03 85 Fully Distributed Reorder Buffer Scheme

ICCD’03 86 Fully Distributed Reorder Buffer Scheme – Distributed ROB Components (ROBCs) are assigned to each Function Unit No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity – The FIFO structure that maintains pointers to the ROBCs remains centralized

ICCD’03 87 Fully Distributed Reorder Buffer Scheme n ROBC # FU_id offset ROBC # m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs

ICCD’03 88 Fully Distributed Reorder Buffer Scheme n ROBC # ROBC # m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs FU_id offset

ICCD’03 89 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment

ICCD’03 90 Results for the Scheme with Retention Latches gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:23%