Download presentation
Presentation is loading. Please wait.
1
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003
2
ICCD’03 2 – Reorder Buffer (ROB) complexities – Motivation for the low-complexity ROB – Low-complexity ROB designs Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme – Results – Concluding remarks Outline
3
ICCD’03 3 P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB
4
ICCD’03 4 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB RB PPC 620-style Superscalar Datapath
5
ICCD’03 5 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment
6
ICCD’03 6 What This Work is All About – ROB complexity reduction is important for reducing power and improving performance ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles – Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
7
ICCD’03 7 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines
8
ICCD’03 8 Instruction dispatch P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Architectural Register File F2 Fetch Decode/Dispatch D2 ROB
9
ICCD’03 9 Reorder Buffer Distribution IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROB Components (ROBCs)
10
ICCD’03 10 Impact of Distributing the ROB – Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time – Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) – Fits in naturally with a multi-clustered datapath design
11
ICCD’03 11 – Port conflicts result in performance penalty – Interconnection network is more complex Problems with the earlier Multi-banked RF Schemes
12
ICCD’03 12 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex and some good news! Problems with the earlier Multi-banked RF Schemes
13
ICCD’03 13 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex Completely remove source read ports and some good news! Problems with the earlier Multi-banked RF Schemes
14
ICCD’03 14 Problems with the earlier Multi-banked RF Schemes – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment Totally avoid source read port conflicts – Interconnection network is more complex Completely remove source read ports and some good news!
15
ICCD’03 15 ROBCs Assigned to Each Function Unit 1 2 3 4 n ROBC #1 11 2 3 1 ROBC #2 1 2 3 4 m1 21 ROBC #m 1 FU #m FU #2 FU #1 Centralized ROBDistributed ROBCs FU_id offset
16
ICCD’03 16 Good News:Write port conflicts are avoided ROBC #1 1 2 3 ROBC #2 1 2 3 4 ROBC #m 1 FU #m FU #2 FU #1 1 write port Distributed ROBCs 1 2 3 4 n 11 m1 21 Centralized ROB FU_id offset
17
ICCD’03 17 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 instruction 5
18
ICCD’03 18 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD instruction 5
19
ICCD’03 19 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD reserved instruction 5
20
ICCD’03 20 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved 5 ADD
21
ICCD’03 21 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 5
22
ICCD’03 22 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB reserved 5
23
ICCD’03 23 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 5
24
ICCD’03 24 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 AND 5
25
ICCD’03 25 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 5
26
ICCD’03 26 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 13 5
27
ICCD’03 27 Good News:Avoiding Read Port Conflicts 1 2 3 4 n 1 2 FU_id offset Centralized ROBDistributed ROBCs 1 2 1 2 1 2 ADD 11 instruction reserved SUB 21 1 read port To commitment 31 AND reserved 5
28
ICCD’03 28 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction SUB 21 AND 13 MUL 5 Int MUL/DIV ROBC #5
29
ICCD’03 29 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 2 1 ADD 11 instruction SUB 21 AND 13 MUL 5 reserved Int MUL/DIV ROBC #5
30
ICCD’03 30 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction reserved SUB 21 AND 13 5 51 MUL Int MUL/DIV ROBC #5 MUL
31
ICCD’03 31 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5
32
ICCD’03 32 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5
33
ICCD’03 33 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 5 51 MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 DIV
34
ICCD’03 34 Read Port Conflicts at Commitment 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 5 51 MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 reserved To commitment CONFLICT: If MUL and DIV wants to commit in the same cycle 1 read port DIV
35
ICCD’03 35 Distributed ROB Design 1 ROBC Writeback 1 write port to write results
36
ICCD’03 36 Distributed ROB Design 1 ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment
37
ICCD’03 37 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment
38
ICCD’03 38 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator Power/energy stats SPICE measures of energy per transition Transition counts, Context information
39
ICCD’03 39 Configuration of the Simulated System Machine width4-way Issue Queue32 entries 96 entriesReorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks
40
ICCD’03 40 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg.
41
ICCD’03 41 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg. 888844444416 Number of entries assigned to each ROBC
42
ICCD’03 42 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg. 888844444416++++++++++= 72 entry 8_4_4_4_16 configuration Number of entries assigned to each ROBC
43
ICCD’03 43 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 Average IPC drop% with 8_4_4_4_16 configuration = 4.8%
44
ICCD’03 44 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 888844444416++++++++++= 72 entry Number of entries assigned to each ROBC
45
ICCD’03 45 Reducing performance penalty: 12_6_4_6_20 Configuration ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 12 64444620++++++++++= 96 entry 12_6_4_6_20 configuration Number of entries assigned to each ROBC
46
ICCD’03 46 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%
47
ICCD’03 47 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment
48
ICCD’03 48 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment
49
ICCD’03 49 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment
50
ICCD’03 50 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3
51
ICCD’03 51 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%
52
ICCD’03 52 How Efficiently are the Ports Used ? ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%
53
ICCD’03 53 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3
54
ICCD’03 54 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3
55
ICCD’03 55 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 1 3 ROB
56
ICCD’03 56 Distributed Reorder Buffer Scheme IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROBCs
57
ICCD’03 57 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs
58
ICCD’03 58 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs
59
ICCD’03 59 Completely Eliminating the Source Read Ports on the ROBCs – The Problem: Issue of instructions that require a value stored in a ROBC will stall – Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
60
ICCD’03 60 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs
61
ICCD’03 61 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs
62
ICCD’03 62 Performance Drop of Simplified ROBC Design Performance Drop % 9.6% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 37% 17%
63
ICCD’03 63 IPC Penalty: Source Value Not Accessible within the ROBC Forwarding Late Forwarding/ Commitment Lifetime of a Result Value Result Generation time Value within ARF Value within a ROBC
64
ICCD’03 64 Improving IPC with No Read Ports – Cache recently generated values in a set of RETENTION LATCHES (RL) – Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports
65
ICCD’03 65 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs
66
ICCD’03 66 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding RETENTION LATCHES Holds pointers to entries within ROBCs
67
ICCD’03 67 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment
68
ICCD’03 68 Distributed ROB Design 2: with Retention Latches ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment Eight, 2-ported FIFO RLs
69
ICCD’03 69 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%
70
ICCD’03 70 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 1.7%
71
ICCD’03 71 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 3.8%
72
ICCD’03 72 Power Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:49%47%23%
73
ICCD’03 73 Power Results for 12_6_4_6_20 Configuration (Compared to Baseline case with 64 entry Rename Buffers) gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:39%37%20%
74
ICCD’03 74 Summary of Results – Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB) – ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96 entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry ROB, 64 entry RB)
75
ICCD’03 75 Conclusions – We introduced a conflict-free distributed Reorder Buffer design – ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty – ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for each ROB components
76
ICCD’03 76 ~ Thank You~
77
ICCD’03 77 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003
78
ICCD’03 78 Related Work – Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization – Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) – Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03 – Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)
79
ICCD’03 79 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment
80
ICCD’03 80 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment
81
ICCD’03 81 Reducing ROB Power and Complexity ROB Phys.regs. ROB
82
ICCD’03 82 LOAD Int MUL Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Distribution Centralized ROB FP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Smaller structures : shorter bitlines, lower capacitive loading, etc. LESS POWER DISSIPATION! Phys.regs.
83
ICCD’03 83 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Dedicate FUs to ROBCs Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs.
84
ICCD’03 84 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Fully Distributed Reorder Buffer Scheme Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs. ROBCs
85
ICCD’03 85 Fully Distributed Reorder Buffer Scheme
86
ICCD’03 86 Fully Distributed Reorder Buffer Scheme – Distributed ROB Components (ROBCs) are assigned to each Function Unit No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity – The FIFO structure that maintains pointers to the ROBCs remains centralized
87
ICCD’03 87 Fully Distributed Reorder Buffer Scheme 1 2 3 4 n ROBC #1 11 2 3 1 FU_id offset ROBC #2 1 2 3 4 m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs
88
ICCD’03 88 Fully Distributed Reorder Buffer Scheme 1 2 3 4 n ROBC #1 11 2 3 1 ROBC #2 1 2 3 4 m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs FU_id offset
89
ICCD’03 89 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment
90
ICCD’03 90 Results for the Scheme with Retention Latches gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:23%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.