Download presentation
Presentation is loading. Please wait.
1
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003
2
ISLPED’03 2 Outline Reorder Buffer (ROB) complexities Motivation for the low-complexity ROB Low-complexity ROB (ICS’02) Improving the design using short-lived values Results Concluding remarks
3
ISLPED’03 3 P6 Style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB
4
ISLPED’03 4 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment
5
ISLPED’03 5 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
6
ISLPED’03 6 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%
7
ISLPED’03 7 How Efficiently are the Ports Used ? ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%
8
ISLPED’03 8 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
9
ISLPED’03 9 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
10
ISLPED’03 10 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ 1 3 ROB
11
ISLPED’03 11 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines
12
ISLPED’03 12 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
13
ISLPED’03 13 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses:
14
ISLPED’03 14 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!
15
ISLPED’03 15 Improving Performance Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports
16
ISLPED’03 16 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Architectural Register File
17
ISLPED’03 17 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ RETENTION LATCHES ROB
18
ISLPED’03 18 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO
19
ISLPED’03 19 Advantages of Using Retention Latches Reduces energy dissipation in the ROB – avoids creating a localized hot spot Reduces associated performance losses Reduces ROB complexity – smaller floor plan, easier validation
20
ISLPED’03 20 Improving Retention Latch Management PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs
21
ISLPED’03 21 Our definition: a value is short-lived if the destination register is renamed by the time of the result generation Identified one cycle before the result writeback LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER Short-Lived Values
22
ISLPED’03 22 AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES Reasons: Short-lived values are forwarded directly to all potential consumers in the issue queue No instruction will ever consume a short- lived value from the retention latches Results: Increased RL hit ratios and better overall performance Key Idea: Do not cache short-lived values
23
ISLPED’03 23 96-entry ROB, 4-way processor The Good News : 80%+ of the Values are Short-Lived %
24
ISLPED’03 24 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1310 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed Identifying Short-Lived Values
25
ISLPED’03 25 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed Identifying Short-Lived Values
26
ISLPED’03 26 Renamed bit is checked one cycle before writeback Value produced by LOAD is short-lived because Renamed [31]=1 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed Identifying Short-Lived Values
27
ISLPED’03 27 Hit Ratios to Retention Latches 46%73% Hit Ratios bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. Average Hit Ratio:
28
ISLPED’03 28 Experimental Results: Effect on Performance IPC 1.7%1.7%0.5%1.1% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:
29
ISLPED’03 29 Experimental Results: Effect on ROB Power Energy (pJ) 15.9%13.7%15.0% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. Savings:
30
ISLPED’03 30 Conclusions We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read ports The idea is to avoid caching the short-lived result values in the retention latches Both retention latch hit ratio and the overall performance improved Alternatively, fewer retention latches can be used with the same performance
31
ISLPED’03 31 THANK YOU ! *supported in part by DARPA through the PAC-C program and NSF LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower International Symposium on Low Power Electronics and Design (ISLPED’03), August 27 th 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.