Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Similar presentations


Presentation on theme: "Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)"— Presentation transcript:

1 Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

2 L2_0 L2_1 L2_2L2_3 L2_4L2_5L2_6L2_7 A Naive methodology for Multi-core Design P0 P1P2P3 P4P5P6P7 Multi-core oblivious multi-core design! Clean, easy way to design

3 Holistic design of multi-core architectures  Naïve Methodology is inefficient  Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]  What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores? Goal of this Research

4 Contributions  We model the implementation of several interconnection mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads  We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for  We show that one cannot design a good interconnect in isolation from the CPU cores and memory design  We propose a novel interconnection architecture which exploits behaviors identified by this research

5 Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

6 Shared Bus Fabric (SBF)  On-chip equivalent of the system bus for snoop-based shared memory multiprocessors We assume a MESI-like snoopy write-invalidate protocol with write-back L2s  SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.) Also needs to arbitrate access to the corresponding busses

7 Shared Bus Fabric (SBF) Book- keeping AB SB RB DB D-arbA-arb L2 Core (incl. I$/D$) Arbiters Queues Buses (pipelined, unidirectional) Control Wires (Mux controls, flow-control, request/grant signals) Details about latencies, overheads etc. in the paper

8 Point-to-point Link (P2PL)  If there are multiple SBFs in the system, a Point-to-point link connects two SBFs. Needs queues and arbiters similar to an SBF  Multiple SBFs might be required in the system To increase bandwidth To decrease signal latencies To ease floorplanning

9 Crossbar Interconnection System  If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection

10 Crossbar Interconnection System Core L2 bank AB (one per core) DoutB(one per core) DinB(one per bank) Loads, stores,prefetches, TLB misses Data Writebacks Data reloads, invalidate addresses

11 Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

12 Wiring Area Overhead Repeater Latch Memory Array

13 Wiring Area Overhead (65nm) Metal Plane Effective Pitch (um) Repeater Spacing (mm) Repeater Width (um) Latch Spacing (mm) Latch Height (um) 1X0.50.4 1.5120 2X1.00.8 3.060 4X2.01.6 5.030 8X4.03.2 8.015 Overheads change based on the metal plane(s) the interconnects are mapped to

14 Wiring Power Overhead  Dynamic dissipation in wires, repeaters and latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch  Leakage in repeaters and latches Channel and gate leakage values in paper

15 Wiring Latency Overhead  Latency of signals traveling through latches  Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues  Hence, latencies depends on the location of the particular core or cache or the arbiter

16 Interconnect-Related Logic Overhead  Arbiters, muxes, queues constitute interconnect- related logic  Area and power overhead primarily due to queues Assumed to be implemented using latches  Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of connected units Latching required between different stages of arbitration

17 Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

18 Modeling Multi-core Architectures  Stripped version of Power-4 like cores, 10mm^2, 10W  Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2  A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units  Weak consistency model, MESI-like coherence  All studies done for 65nm

19 Floorplans for 4,8 and 16 core processors [assuming private caches] SBF IOX MC NCU IOX MC NCU SBF IOX MC IOX MC NCU SBF NCU SBF NCU MC IOX MC IOX MC Core L2 Data L2 Tag P2P Link Note that there are two SBFs for 16 core processor

20 Performance Modeling  Used a combination of detailed functional simulation and queuing simulation  Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench etc) Output: Coherence statistics for modeled memory/interconnection system  Queuing simulator Input: Coherence statistics, interconnection latencies, CPI of the modeled core assuming infinite L2 Output: System CPI

21 Results…finally!

22 SBF: Wiring Area Overhead Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache! Co-design needed: More cores, more cache or more interconnect bandwidth? Observed a scenario where decreasing bandwidth improved performance

23 SBF: Wiring Area Overhead Control overhead 37%-63% of total overhead Constrains how much area can be reduced with narrower busses

24 SBF: Wiring Area Overhead Argues against lightweight cores – do not amortize the incremental cost to the interconnect

25 SBF: Power Overhead Power overhead can be significant for large number of cores

26 SBF: Power Overhead Power due to queues more than that due to wires! Good interconnect architecture should have efficient queuing and flow-control

27 SBF: Performance Interconnect Overhead can be significant – 10-26%! Interconnect accounts for over half the latency to the L2 cache

28 Shared Caches and Crossbar  Results for the 8-core processor 2-way, 4-way and full-sharing of the L2 cache  Results are shown for two cases – when crossbar sits between cores and L2  Easy interfacing, but all wiring tracks results in area overhead when the crossbar is routed over L2  Interfacing difficult, area overhead only due to reduced cache density

29 Crossbar: Area Overhead 11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!

30 Crossbar: Area Overhead (contd)  What is the point of cache sharing? Cores get the effect of having more cache space  But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them private and reclaim the area used by the crossbar  In other words, does sharing have any benefit in such scenarios?

31 Crossbar: Performance Overhead

32 Accompanying grain of salt  Simplified interconnection model assumed  Systems with memory scouts etc. may have different memory system requirements  Non Uniform Caches (NUCA) might improve performance  Etc. etc.  However, results do show that a shared cache is significantly less desirable for future technologies

33 What have we learned so far? (in terms of bottlenecks)

34 Interconnection bottlenecks (and possible solutions)  Long wires result in long latencies See if wires can be shortened  Centralized arbitration See if arbitration can be distributed  Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus can be decreased

35 A Hierarchical Interconnect CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data

36 SBF A Hierarchical Interconnect A local and a remote SBF (smaller average case latency, longer worse case latency) CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data P2PL

37 A Hierarchical Interconnect (contd)  The threads need to be mapped intelligently To increase hit rate in caches connected to the local SBF  For some cases, even random mapping results in better performance E.g. for the 8-core processor shown  More research needs to be done for hierarchical interconnects More description in the paper

38 Conclusions  Design choices for interconnects have significant effect on the rest of the chip Should be co-designed with cores and caches  Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-control important  Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP research proposals  A hierarchical bus structure can negate some of the interconnection performance cost


Download ppt "Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)"

Similar presentations


Ads by Google