Download presentation
Presentation is loading. Please wait.
1
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar Rajeev Balasubramonian Power and Temperature-Aware Microarchitecture
2
June 20 th 2004 2 Motivation Wire delays do not scale as well as their transistor counterparts Communication bound future processors Increased use of interconnects and hence, an increase in power dissipation 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)
3
June 20 th 2004 3 Interconnect Power Reduction in power Increase in latency Dynamic Power = aCV 2 f Different Methods Frequency scaling Voltage scaling Reducing the size of repeaters Reducing the no. of repeaters
4
June 20 th 2004 4 Power-Delay Tradeoff Conventional Interconnect Design – Performance Oriented Low latency High Power Dissipation Power Reduction by tolerating some delay penalty Reducing Repeater Size L D SC Decreasing No. of Repeaters L D SC Latency increases
5
June 20 th 2004 5 Power Reduction Ref: Banerjee et al. IEEE Transactions on Electron Devices 2002
6
June 20 th 2004 6 Impact of Power-centric Design Delay Optimized Case – Wires optimized for delay Power Optimized case – Wires optimized for power Performance difference 20%
7
June 20 th 2004 7 Heterogeneous Interconnects Proposed Design – Implementing wires with varied characteristics Delay optimized interconnect Power optimized interconnect Latencies twice the delay optimal wires 80% reduction in power (by focusing on repeaters alone)
8
June 20 th 2004 8 Outline Motivation & Proposed solution Base Architecture Interconnect Transfers Results Conclusion & Future work
9
June 20 th 2004 9 Architecture for evaluation A dynamically scheduled clustered model with 16 clusters Hierarchical interconnects Crossbar Ring Centralized front-end I-Cache & D-Cache LSQ Branch Predictor Four FU/cluster I-Cache D-cache LSQ Cluster Cross bar (1 cycle latency) Ring interconnect (4 cycle latency)
10
June 20 th 2004 10 Simulator Parameters Simplescalar with contention modeled in detail 15 entry o-o-o issue queue in each cluster (int & fp each) 30 Physical registers (int & fp each) In-flight window - 480 instructions Inter-cluster latencies Delay optimized 2-10 cycles Power optimized 4-20 cycles
11
June 20 th 2004 11 Interconnect transfers - Types Bypassed register value Ready register value Address transfer Store value Load value
12
June 20 th 2004 12 Bypassed Register Values Operands produced in a cluster that are immediately required by another cluster Criticality based on two factors Operand arrival time at the cluster Actual issue time of the sourcing instruction Criticality changes at runtime Needs a dynamic predictor Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU Producing Instruction completing execution at cycle 120 Consumer Instruction dispatched at Cycle 100
13
June 20 th 2004 13 The Data Criticality Predictor A table indexed by the lower order bits of the instruction address, updated dynamically to indicate the criticality of data. Difference in arrival time and usage calculated for each operand of an instruction Difference < Threshold Critical Difference > Threshold Non-Critical
14
June 20 th 2004 14 Ready Register Values Source operands that are available at the time of dispatch Premise - significant latency between dispatch and issue Latency tolerant Power optimized wires Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100
15
June 20 th 2004 15 Load & Store data Store data – Often non-critical Impact of delayed stores ( rare cases ) Dependent loads have to wait Stall in the commit process if store is at the head of the reorder buffer Latency insensitive – Power optimized network Load data – Critical! Often on the critical path Latency sensitive – Fast network
16
June 20 th 2004 16 Address prediction High confidence prediction for 51% of effective address transfers L1 Cache LSQ AP FU Reg L1 Cache LSQ FU Reg
17
June 20 th 2004 17 Summary of transfers CriticalNon-Critical Load ValuesStore value Effective address unpredictedEffective address predicted Bypassed register value Ready register value
18
June 20 th 2004 18 Outline Motivation & Proposed solution Base Architecture Interconnect Transfers Simulation results Conclusion
19
June 20 th 2004 19 Methodology Three cases for simulation High Performance case – A clustered model with only delay optimized wires Low Power case – A clustered model with all power-optimized wires Criticality based case – A clustered model using heterogeneous wires
20
June 20 th 2004 20 Results Performance loss in criticality based case compared to high performance case 2.5% Performance loss in low power case compared to high performance case is 20%
21
June 20 th 2004 21 Results % Non-critical transfers % IPC loss
22
June 20 th 2004 22 Summary of non-critical interconnect transfers Effective address predicted Unpredicted address Bypassed non-criticalBypassed critical Ready register Store value Load value
23
June 20 th 2004 23 Result summary Two kinds of non-critical transfers Data that are not immediately used – 38% Verification of address predictions – 13% Criticality based case 49% of all data transfers through the Power-optimized wires Performance penalty - only 2.5% Potential energy savings of around 50% in the interconnects
24
June 20 th 2004 24 Related Work Proposal of several heuristics for data criticality – Tune et al. [HPCA -7], Srinivasan et al. [ISCA-28] Redirection of instructions to units based on criticality – Seng et al. [MICRO 2001] Balasubramonian et al. evaluated heterogeneous cache banks [MICRO 2003] Banerjee and Mehrotra came up with an analytical model for designing interconnect for a given delay penalty [IEEE Trans. 2002]
25
June 20 th 2004 25 Future Work Other metrics for data criticality prediction (low confidence branch) Application of heterogeneous interconnect in other places of the microprocessor (cache etc.) Other configurations of heterogeneous interconnect
26
June 20 th 2004 26 Conclusion Single interconnect model optimized for delay or power alone is not enough Heterogeneous interconnect model alleviates this problem Criticality predictor efficiently identifies non-critical data 49% goes in non-critical network – performance loss 2.5%
27
June 20 th 2004 27 Questions ? Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.