ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002
ICS’02 2 Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks
ICS’02 3 What This Work is All About Complex, richly-ported ROBs are common in modern superscalar datapaths Number of ports are aggravated when results are held within ROB slots (Example: Pentium III) ROB complexity reduction is important for reducing power and improving performance ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
ICS’02 4 Pentium III-like Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB
ICS’02 5 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment
ICS’02 6 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment
ICS’02 7 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ICS’02 8 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%
ICS’02 9 How Efficiently are the Ports Used ? ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%
ICS’02 10 Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values!
ICS’02 11 Reducing the Number of Read Ports Performance Drop % 3.5%1.0% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg.
ICS’02 12 Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic
ICS’02 13 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ICS’02 14 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ICS’02 15 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ 1 3 ROB
ICS’02 16 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines
ICS’02 17 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Area Reduction – 45%
ICS’02 18 Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps
ICS’02 19 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
ICS’02 20 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses:
ICS’02 21 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!
ICS’02 22 Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small
ICS’02 23 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Only 3.5% of the traffic is from SELECTIVE LATE FORWARDING EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!
ICS’02 24 Performance Drop of Simplified ROB Performance Drop % 9.6%3.5%1.0% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 37% 17%
ICS’02 25 IPC Penalty: Source Value Not Accessible within the ROB Forwarding Late Forwarding/ Commitment Lifetime of a Result Value Result Generation time Value within ARF Value within ROB
ICS’02 26 Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports
ICS’02 27 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Architectural Register File
ICS’02 28 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ RETENTION LATCHES ROB
ICS’02 29 The Structure of the Retention Latch Set L ROB slot addresses (L=1 or 2) L-ported CAM field (key = ROB_slot_id) W write ports for writing up to W results in parallel Status L recently-written results (L=1 or 2 works great) Result Values 8 or 16 latches
ICS’02 30 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO
ICS’02 31 Hit Ratios to Retention Latches 42%55%56%62% Hit Ratios bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. Average Hit Ratio:
ICS’02 32 Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions
ICS’02 33 Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing
ICS’02 34 Misprediction Handling: Performance 1.5% Average IPC Drop: IPC
ICS’02 35 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 arch. 3 Src1 arch. 2 ADD Instruction Instruction: ADD R1, R2, R3
ICS’02 36 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB=0 ARF= … …… …… …… … … ROB# /Phys. Rename Table
ICS’02 37 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 17 Rename Table ROB
ICS’02 38 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 17 Rename Table ROB
ICS’02 39 Scenario 1: Traditional Design 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 0? Rename Table ROB
ICS’02 40 Scenario 1: Traditional Design 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… 0? Rename Table ROB
ICS’02 41 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Arch. value …… 3 …… 43 Rename Table ARF
ICS’02 42 Scenario 1: Traditional Design 5 ROB index Src1 valid 1 Src1 value 7 43 Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … Arch. value …… 3 …… 43 Rename Table ARF
ICS’02 43 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 arch. 3 Src1 arch. 2 ADD Instruction Instruction: ADD R1, R2, R3
ICS’02 44 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB=0 ARF= … …… …… …… … … ROB# /Phys. Rename Table
ICS’02 45 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. value …… 12 …… 7 Rename Table Retention Latches
ICS’02 46 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Rename Table ROB# /Phys. Phys. value …… 12 …… 7 Retention Latches
ICS’02 47 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid ? Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Rename Table ROB# /Phys. Phys. value …… … …… … MISS Retention Latches
ICS’02 48 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… XX Rename Table ROB ROB# /Phys. Phys. value …… … …… … Retention Latches MISS X: Don’t Care SLF … … 0
ICS’02 49 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 0 Src1 value ? ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? ROB# /Phys. Phys. valid Phys. value ……… 12 ……… XX Rename Table ROB ROB# /Phys. Phys. value …… … …… … Retention Latches MISS X: Don’t Care SLF … … 1
ICS’02 50 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 ? Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … ? Arch. value …… 3 …… 43 Rename Table ARF
ICS’02 51 Scenario 2: Simplified ROB with RLs 5 ROB index Src1 valid 1 Src1 value 7 43 Src2 valid Src2 value Simplified IDB entry #1 Src2 reg. 3 Src1 reg. 2 ADD Instruction Instruction: ADD R1, R2, R3 Arch. ROB# /Phys. ROB=0 ARF= … …… …… …… … … Arch. value …… 3 …… 43 Rename Table ARF
ICS’02 52 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator Power/energy stats SPICE measures of energy per transition Transition counts, Context information
ICS’02 53 Configuration of the Simulated System Machine width4-way Issue Queue32 entries 96 entriesReorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks
ICS’02 54 Assumed Timings Rename Table lookup for ROB index Rename Table Lookup for ROB index Associative lookup of operand from retention latches using ROB index as a key Source operand read from the ROB Source operand read from the ROB Smaller delay: few latches D1 D2D3 D1 D2 Timing of the baseline modelTiming of the simplified ROB
ICS’02 55 Experimental Results: Effect on Performance Performance Drop % 0.1%-1.6%-1.0%-2.3% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:
ICS’02 56 Experimental Results: Effect on Performance Performance Drop % 3.3%1.7%2.3%1.0% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:
ICS’02 57 Experimental Results: Effect on Power Power Savings % bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 30%23.4%22.2%21%20.2% Avg. Savings:
ICS’02 58 Summary of Results Significantly reduced ROB complexity and power dissipation 45% area reduction 20% to 30% power reduction across SPEC 2000 benchmarks Actual IPC improvements: 1.6% to 2.3% gain across SPEC benchmarks IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)
ICS’02 59 Related Work Value-Aging Buffer (Hu & Martonosi, PACS 2000) Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02) Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) See paper for discussions
ICS’02 60 Conclusions Typical source operand location statistics can be successfully exploited to reduce ROB complexity Significant reduction in ROB area and power – no ROB ports needed for reading source operands IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle
ICS’02 61 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002