Download presentation
Presentation is loading. Please wait.
Published byMaria Shaw Modified over 9 years ago
1
Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)
2
L2_0 L2_1 L2_2L2_3 L2_4L2_5L2_6L2_7 A Naive methodology for Multi-core Design P0 P1P2P3 P4P5P6P7 Multi-core oblivious multi-core design! Clean, easy way to design
3
Holistic design of multi-core architectures Naïve Methodology is inefficient Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04] What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores? Goal of this Research
4
Contributions We model the implementation of several interconnection mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for We show that one cannot design a good interconnect in isolation from the CPU cores and memory design We propose a novel interconnection architecture which exploits behaviors identified by this research
5
Talk Outline Interconnection Models Shared Bus Fabric (SBF) Point-to-point links Crossbar Modeling area, power and latency Evaluation Methodology SBF and Crossbar results Novel architecture
6
Shared Bus Fabric (SBF) On-chip equivalent of the system bus for snoop-based shared memory multiprocessors We assume a MESI-like snoopy write-invalidate protocol with write-back L2s SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.) Also needs to arbitrate access to the corresponding busses
7
Shared Bus Fabric (SBF) Book- keeping AB SB RB DB D-arbA-arb L2 Core (incl. I$/D$) Arbiters Queues Buses (pipelined, unidirectional) Control Wires (Mux controls, flow-control, request/grant signals) Details about latencies, overheads etc. in the paper
8
Point-to-point Link (P2PL) If there are multiple SBFs in the system, a Point-to-point link connects two SBFs. Needs queues and arbiters similar to an SBF Multiple SBFs might be required in the system To increase bandwidth To decrease signal latencies To ease floorplanning
9
Crossbar Interconnection System If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection
10
Crossbar Interconnection System Core L2 bank AB (one per core) DoutB(one per core) DinB(one per bank) Loads, stores,prefetches, TLB misses Data Writebacks Data reloads, invalidate addresses
11
Talk Outline Interconnection Models Shared Bus Fabric (SBF) Point-to-point links Crossbar Modeling area, power and latency Evaluation Methodology SBF and Crossbar results Novel architecture
12
Wiring Area Overhead Repeater Latch Memory Array
13
Wiring Area Overhead (65nm) Metal Plane Effective Pitch (um) Repeater Spacing (mm) Repeater Width (um) Latch Spacing (mm) Latch Height (um) 1X0.50.4 1.5120 2X1.00.8 3.060 4X2.01.6 5.030 8X4.03.2 8.015 Overheads change based on the metal plane(s) the interconnects are mapped to
14
Wiring Power Overhead Dynamic dissipation in wires, repeaters and latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch Leakage in repeaters and latches Channel and gate leakage values in paper
15
Wiring Latency Overhead Latency of signals traveling through latches Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues Hence, latencies depends on the location of the particular core or cache or the arbiter
16
Interconnect-Related Logic Overhead Arbiters, muxes, queues constitute interconnect- related logic Area and power overhead primarily due to queues Assumed to be implemented using latches Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of connected units Latching required between different stages of arbitration
17
Talk Outline Interconnection Models Shared Bus Fabric (SBF) Point-to-point links Crossbar Modeling area, power and latency Evaluation Methodology SBF and Crossbar results Novel architecture
18
Modeling Multi-core Architectures Stripped version of Power-4 like cores, 10mm^2, 10W Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2 A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units Weak consistency model, MESI-like coherence All studies done for 65nm
19
Floorplans for 4,8 and 16 core processors [assuming private caches] SBF IOX MC NCU IOX MC NCU SBF IOX MC IOX MC NCU SBF NCU SBF NCU MC IOX MC IOX MC Core L2 Data L2 Tag P2P Link Note that there are two SBFs for 16 core processor
20
Performance Modeling Used a combination of detailed functional simulation and queuing simulation Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench etc) Output: Coherence statistics for modeled memory/interconnection system Queuing simulator Input: Coherence statistics, interconnection latencies, CPI of the modeled core assuming infinite L2 Output: System CPI
21
Results…finally!
22
SBF: Wiring Area Overhead Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache! Co-design needed: More cores, more cache or more interconnect bandwidth? Observed a scenario where decreasing bandwidth improved performance
23
SBF: Wiring Area Overhead Control overhead 37%-63% of total overhead Constrains how much area can be reduced with narrower busses
24
SBF: Wiring Area Overhead Argues against lightweight cores – do not amortize the incremental cost to the interconnect
25
SBF: Power Overhead Power overhead can be significant for large number of cores
26
SBF: Power Overhead Power due to queues more than that due to wires! Good interconnect architecture should have efficient queuing and flow-control
27
SBF: Performance Interconnect Overhead can be significant – 10-26%! Interconnect accounts for over half the latency to the L2 cache
28
Shared Caches and Crossbar Results for the 8-core processor 2-way, 4-way and full-sharing of the L2 cache Results are shown for two cases – when crossbar sits between cores and L2 Easy interfacing, but all wiring tracks results in area overhead when the crossbar is routed over L2 Interfacing difficult, area overhead only due to reduced cache density
29
Crossbar: Area Overhead 11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!
30
Crossbar: Area Overhead (contd) What is the point of cache sharing? Cores get the effect of having more cache space But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them private and reclaim the area used by the crossbar In other words, does sharing have any benefit in such scenarios?
31
Crossbar: Performance Overhead
32
Accompanying grain of salt Simplified interconnection model assumed Systems with memory scouts etc. may have different memory system requirements Non Uniform Caches (NUCA) might improve performance Etc. etc. However, results do show that a shared cache is significantly less desirable for future technologies
33
What have we learned so far? (in terms of bottlenecks)
34
Interconnection bottlenecks (and possible solutions) Long wires result in long latencies See if wires can be shortened Centralized arbitration See if arbitration can be distributed Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus can be decreased
35
A Hierarchical Interconnect CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data
36
SBF A Hierarchical Interconnect A local and a remote SBF (smaller average case latency, longer worse case latency) CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data P2PL
37
A Hierarchical Interconnect (contd) The threads need to be mapped intelligently To increase hit rate in caches connected to the local SBF For some cases, even random mapping results in better performance E.g. for the 8-core processor shown More research needs to be done for hierarchical interconnects More description in the paper
38
Conclusions Design choices for interconnects have significant effect on the rest of the chip Should be co-designed with cores and caches Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-control important Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP research proposals A hierarchical bus structure can negate some of the interconnection performance cost
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.