Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

L2_0 L2_1 L2_2L2_3 L2_4L2_5L2_6L2_7 A Naive methodology for Multi-core Design P0 P1P2P3 P4P5P6P7 Multi-core oblivious multi-core design! Clean, easy way to design

Holistic design of multi-core architectures  Naïve Methodology is inefficient  Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]  What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores? Goal of this Research

Contributions  We model the implementation of several interconnection mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads  We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for  We show that one cannot design a good interconnect in isolation from the CPU cores and memory design  We propose a novel interconnection architecture which exploits behaviors identified by this research

Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

Shared Bus Fabric (SBF)  On-chip equivalent of the system bus for snoop-based shared memory multiprocessors We assume a MESI-like snoopy write-invalidate protocol with write-back L2s  SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.) Also needs to arbitrate access to the corresponding busses

Shared Bus Fabric (SBF) Book- keeping AB SB RB DB D-arbA-arb L2 Core (incl. I$/D$) Arbiters Queues Buses (pipelined, unidirectional) Control Wires (Mux controls, flow-control, request/grant signals) Details about latencies, overheads etc. in the paper

Point-to-point Link (P2PL)  If there are multiple SBFs in the system, a Point-to-point link connects two SBFs. Needs queues and arbiters similar to an SBF  Multiple SBFs might be required in the system To increase bandwidth To decrease signal latencies To ease floorplanning

Crossbar Interconnection System  If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection

Crossbar Interconnection System Core L2 bank AB (one per core) DoutB(one per core) DinB(one per bank) Loads, stores,prefetches, TLB misses Data Writebacks Data reloads, invalidate addresses

Wiring Area Overhead Repeater Latch Memory Array

Wiring Area Overhead (65nm) Metal Plane Effective Pitch (um) Repeater Spacing (mm) Repeater Width (um) Latch Spacing (mm) Latch Height (um) 1X0.50.4 1.5120 2X1.00.8 3.060 4X2.01.6 5.030 8X4.03.2 8.015 Overheads change based on the metal plane(s) the interconnects are mapped to

Wiring Power Overhead  Dynamic dissipation in wires, repeaters and latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch  Leakage in repeaters and latches Channel and gate leakage values in paper

Wiring Latency Overhead  Latency of signals traveling through latches  Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues  Hence, latencies depends on the location of the particular core or cache or the arbiter

Interconnect-Related Logic Overhead  Arbiters, muxes, queues constitute interconnect- related logic  Area and power overhead primarily due to queues Assumed to be implemented using latches  Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of connected units Latching required between different stages of arbitration

Modeling Multi-core Architectures  Stripped version of Power-4 like cores, 10mm^2, 10W  Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2  A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units  Weak consistency model, MESI-like coherence  All studies done for 65nm

Floorplans for 4,8 and 16 core processors [assuming private caches] SBF IOX MC NCU IOX MC NCU SBF IOX MC IOX MC NCU SBF NCU SBF NCU MC IOX MC IOX MC Core L2 Data L2 Tag P2P Link Note that there are two SBFs for 16 core processor

Performance Modeling  Used a combination of detailed functional simulation and queuing simulation  Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench etc) Output: Coherence statistics for modeled memory/interconnection system  Queuing simulator Input: Coherence statistics, interconnection latencies, CPI of the modeled core assuming infinite L2 Output: System CPI

Results…finally!

SBF: Wiring Area Overhead Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache! Co-design needed: More cores, more cache or more interconnect bandwidth? Observed a scenario where decreasing bandwidth improved performance

SBF: Wiring Area Overhead Control overhead 37%-63% of total overhead Constrains how much area can be reduced with narrower busses

SBF: Wiring Area Overhead Argues against lightweight cores – do not amortize the incremental cost to the interconnect

SBF: Power Overhead Power overhead can be significant for large number of cores

SBF: Power Overhead Power due to queues more than that due to wires! Good interconnect architecture should have efficient queuing and flow-control

SBF: Performance Interconnect Overhead can be significant – 10-26%! Interconnect accounts for over half the latency to the L2 cache

Shared Caches and Crossbar  Results for the 8-core processor 2-way, 4-way and full-sharing of the L2 cache  Results are shown for two cases – when crossbar sits between cores and L2  Easy interfacing, but all wiring tracks results in area overhead when the crossbar is routed over L2  Interfacing difficult, area overhead only due to reduced cache density

Crossbar: Area Overhead 11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!

Crossbar: Area Overhead (contd)  What is the point of cache sharing? Cores get the effect of having more cache space  But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them private and reclaim the area used by the crossbar  In other words, does sharing have any benefit in such scenarios?

Crossbar: Performance Overhead

Accompanying grain of salt  Simplified interconnection model assumed  Systems with memory scouts etc. may have different memory system requirements  Non Uniform Caches (NUCA) might improve performance  Etc. etc.  However, results do show that a shared cache is significantly less desirable for future technologies

What have we learned so far? (in terms of bottlenecks)

Interconnection bottlenecks (and possible solutions)  Long wires result in long latencies See if wires can be shortened  Centralized arbitration See if arbitration can be distributed  Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus can be decreased

A Hierarchical Interconnect CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data

SBF A Hierarchical Interconnect A local and a remote SBF (smaller average case latency, longer worse case latency) CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data P2PL

A Hierarchical Interconnect (contd)  The threads need to be mapped intelligently To increase hit rate in caches connected to the local SBF  For some cases, even random mapping results in better performance E.g. for the 8-core processor shown  More research needs to be done for hierarchical interconnects More description in the paper

Conclusions  Design choices for interconnects have significant effect on the rest of the chip Should be co-designed with cores and caches  Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-control important  Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP research proposals  A hierarchical bus structure can negate some of the interconnection performance cost

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Similar presentations

Presentation on theme: "Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Similar presentations

Presentation on theme: "Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)"— Presentation transcript:

Similar presentations

About project

Feedback