June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar Rajeev Balasubramonian Power and Temperature-Aware Microarchitecture

June 20 th 2004 2 Motivation  Wire delays do not scale as well as their transistor counterparts  Communication bound future processors  Increased use of interconnects and hence, an increase in power dissipation  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)

June 20 th 2004 3 Interconnect Power  Reduction in power Increase in latency Dynamic Power = aCV 2 f  Different Methods Frequency scaling Voltage scaling Reducing the size of repeaters Reducing the no. of repeaters

June 20 th 2004 4 Power-Delay Tradeoff  Conventional Interconnect Design – Performance Oriented Low latency High Power Dissipation  Power Reduction by tolerating some delay penalty Reducing Repeater Size L D SC Decreasing No. of Repeaters L D SC Latency increases

June 20 th 2004 5 Power Reduction Ref: Banerjee et al. IEEE Transactions on Electron Devices 2002

June 20 th 2004 6 Impact of Power-centric Design  Delay Optimized Case – Wires optimized for delay  Power Optimized case – Wires optimized for power  Performance difference 20%

June 20 th 2004 7 Heterogeneous Interconnects  Proposed Design – Implementing wires with varied characteristics Delay optimized interconnect Power optimized interconnect  Latencies twice the delay optimal wires  80% reduction in power (by focusing on repeaters alone)

June 20 th 2004 8 Outline  Motivation & Proposed solution  Base Architecture  Interconnect Transfers  Results  Conclusion & Future work

June 20 th 2004 9 Architecture for evaluation  A dynamically scheduled clustered model with 16 clusters  Hierarchical interconnects Crossbar Ring  Centralized front-end I-Cache & D-Cache LSQ Branch Predictor  Four FU/cluster I-Cache D-cache LSQ Cluster Cross bar (1 cycle latency) Ring interconnect (4 cycle latency)

June 20 th 2004 10 Simulator Parameters  Simplescalar with contention modeled in detail  15 entry o-o-o issue queue in each cluster (int & fp each)  30 Physical registers (int & fp each)  In-flight window - 480 instructions  Inter-cluster latencies Delay optimized 2-10 cycles Power optimized 4-20 cycles

June 20 th 2004 11 Interconnect transfers - Types Bypassed register value Ready register value Address transfer Store value Load value

June 20 th 2004 12 Bypassed Register Values  Operands produced in a cluster that are immediately required by another cluster  Criticality based on two factors Operand arrival time at the cluster Actual issue time of the sourcing instruction  Criticality changes at runtime Needs a dynamic predictor Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU Producing Instruction completing execution at cycle 120 Consumer Instruction dispatched at Cycle 100

June 20 th 2004 13 The Data Criticality Predictor  A table indexed by the lower order bits of the instruction address, updated dynamically to indicate the criticality of data.  Difference in arrival time and usage calculated for each operand of an instruction  Difference < Threshold Critical  Difference > Threshold Non-Critical

June 20 th 2004 14 Ready Register Values  Source operands that are available at the time of dispatch  Premise - significant latency between dispatch and issue  Latency tolerant Power optimized wires Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100

June 20 th 2004 15 Load & Store data  Store data – Often non-critical Impact of delayed stores ( rare cases )  Dependent loads have to wait  Stall in the commit process if store is at the head of the reorder buffer Latency insensitive – Power optimized network  Load data – Critical! Often on the critical path Latency sensitive – Fast network

June 20 th 2004 16 Address prediction  High confidence prediction for 51% of effective address transfers L1 Cache LSQ AP FU Reg L1 Cache LSQ FU Reg

June 20 th 2004 17 Summary of transfers CriticalNon-Critical Load ValuesStore value Effective address unpredictedEffective address predicted Bypassed register value Ready register value

June 20 th 2004 18 Outline  Motivation & Proposed solution  Base Architecture  Interconnect Transfers  Simulation results  Conclusion

June 20 th 2004 19 Methodology Three cases for simulation High Performance case – A clustered model with only delay optimized wires Low Power case – A clustered model with all power-optimized wires Criticality based case – A clustered model using heterogeneous wires

June 20 th 2004 20 Results  Performance loss in criticality based case compared to high performance case 2.5%  Performance loss in low power case compared to high performance case is 20%

June 20 th 2004 21 Results % Non-critical transfers % IPC loss

June 20 th 2004 22 Summary of non-critical interconnect transfers Effective address predicted Unpredicted address Bypassed non-criticalBypassed critical Ready register Store value Load value

June 20 th 2004 23 Result summary  Two kinds of non-critical transfers Data that are not immediately used – 38% Verification of address predictions – 13%  Criticality based case 49% of all data transfers through the Power-optimized wires Performance penalty - only 2.5% Potential energy savings of around 50% in the interconnects

June 20 th 2004 24 Related Work  Proposal of several heuristics for data criticality – Tune et al. [HPCA -7], Srinivasan et al. [ISCA-28]  Redirection of instructions to units based on criticality – Seng et al. [MICRO 2001]  Balasubramonian et al. evaluated heterogeneous cache banks [MICRO 2003]  Banerjee and Mehrotra came up with an analytical model for designing interconnect for a given delay penalty [IEEE Trans. 2002]

June 20 th 2004 25 Future Work  Other metrics for data criticality prediction (low confidence branch)  Application of heterogeneous interconnect in other places of the microprocessor (cache etc.)  Other configurations of heterogeneous interconnect

June 20 th 2004 26 Conclusion  Single interconnect model optimized for delay or power alone is not enough  Heterogeneous interconnect model alleviates this problem  Criticality predictor efficiently identifies non-critical data  49% goes in non-critical network – performance loss 2.5%

June 20 th 2004 27 Questions ? Thank You

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Similar presentations

Presentation on theme: "June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Similar presentations

Presentation on theme: "June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar."— Presentation transcript:

Similar presentations

About project

Feedback