Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Cache Optimization Summary

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

The University of Adelaide, School of Computer Science

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

Elastic-Buffer Flow-Control for On-Chip Networks

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Spring 2007W. Rhett DavisNC State UniversityECE 747Slide 1 ECE 747 Digital Signal Processing Architecture SoC Lecture – Working with Buses & Interconnects.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

The University of Adelaide, School of Computer Science

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

CS 6290 Many-core & Interconnect

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessor System Interconnects

Presentation transcript:

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

L2_0 L2_1 L2_2L2_3 L2_4L2_5L2_6L2_7 A Naive methodology for Multi-core Design P0 P1P2P3 P4P5P6P7 Multi-core oblivious multi-core design! Clean, easy way to design

Holistic design of multi-core architectures  Naïve Methodology is inefficient  Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]  What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores? Goal of this Research

Contributions  We model the implementation of several interconnection mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads  We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for  We show that one cannot design a good interconnect in isolation from the CPU cores and memory design  We propose a novel interconnection architecture which exploits behaviors identified by this research

Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

Shared Bus Fabric (SBF)  On-chip equivalent of the system bus for snoop-based shared memory multiprocessors We assume a MESI-like snoopy write-invalidate protocol with write-back L2s  SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.) Also needs to arbitrate access to the corresponding busses

Shared Bus Fabric (SBF) Book- keeping AB SB RB DB D-arbA-arb L2 Core (incl. I$/D$) Arbiters Queues Buses (pipelined, unidirectional) Control Wires (Mux controls, flow-control, request/grant signals) Details about latencies, overheads etc. in the paper

Point-to-point Link (P2PL)  If there are multiple SBFs in the system, a Point-to-point link connects two SBFs. Needs queues and arbiters similar to an SBF  Multiple SBFs might be required in the system To increase bandwidth To decrease signal latencies To ease floorplanning

Crossbar Interconnection System  If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection

Crossbar Interconnection System Core L2 bank AB (one per core) DoutB(one per core) DinB(one per bank) Loads, stores,prefetches, TLB misses Data Writebacks Data reloads, invalidate addresses

Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

Wiring Area Overhead Repeater Latch Memory Array

Wiring Area Overhead (65nm) Metal Plane Effective Pitch (um) Repeater Spacing (mm) Repeater Width (um) Latch Spacing (mm) Latch Height (um) 1X X X X Overheads change based on the metal plane(s) the interconnects are mapped to

Wiring Power Overhead  Dynamic dissipation in wires, repeaters and latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch  Leakage in repeaters and latches Channel and gate leakage values in paper

Wiring Latency Overhead  Latency of signals traveling through latches  Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues  Hence, latencies depends on the location of the particular core or cache or the arbiter

Interconnect-Related Logic Overhead  Arbiters, muxes, queues constitute interconnect- related logic  Area and power overhead primarily due to queues Assumed to be implemented using latches  Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of connected units Latching required between different stages of arbitration

Talk Outline  Interconnection Models  Shared Bus Fabric (SBF)  Point-to-point links  Crossbar  Modeling area, power and latency  Evaluation Methodology  SBF and Crossbar results  Novel architecture

Modeling Multi-core Architectures  Stripped version of Power-4 like cores, 10mm^2, 10W  Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2  A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units  Weak consistency model, MESI-like coherence  All studies done for 65nm

Floorplans for 4,8 and 16 core processors [assuming private caches] SBF IOX MC NCU IOX MC NCU SBF IOX MC IOX MC NCU SBF NCU SBF NCU MC IOX MC IOX MC Core L2 Data L2 Tag P2P Link Note that there are two SBFs for 16 core processor

Performance Modeling  Used a combination of detailed functional simulation and queuing simulation  Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench etc) Output: Coherence statistics for modeled memory/interconnection system  Queuing simulator Input: Coherence statistics, interconnection latencies, CPI of the modeled core assuming infinite L2 Output: System CPI

Results…finally!

SBF: Wiring Area Overhead Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache! Co-design needed: More cores, more cache or more interconnect bandwidth? Observed a scenario where decreasing bandwidth improved performance

SBF: Wiring Area Overhead Control overhead 37%-63% of total overhead Constrains how much area can be reduced with narrower busses

SBF: Wiring Area Overhead Argues against lightweight cores – do not amortize the incremental cost to the interconnect

SBF: Power Overhead Power overhead can be significant for large number of cores

SBF: Power Overhead Power due to queues more than that due to wires! Good interconnect architecture should have efficient queuing and flow-control

SBF: Performance Interconnect Overhead can be significant – 10-26%! Interconnect accounts for over half the latency to the L2 cache

Shared Caches and Crossbar  Results for the 8-core processor 2-way, 4-way and full-sharing of the L2 cache  Results are shown for two cases – when crossbar sits between cores and L2  Easy interfacing, but all wiring tracks results in area overhead when the crossbar is routed over L2  Interfacing difficult, area overhead only due to reduced cache density

Crossbar: Area Overhead 11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!

Crossbar: Area Overhead (contd)  What is the point of cache sharing? Cores get the effect of having more cache space  But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them private and reclaim the area used by the crossbar  In other words, does sharing have any benefit in such scenarios?

Crossbar: Performance Overhead

Accompanying grain of salt  Simplified interconnection model assumed  Systems with memory scouts etc. may have different memory system requirements  Non Uniform Caches (NUCA) might improve performance  Etc. etc.  However, results do show that a shared cache is significantly less desirable for future technologies

What have we learned so far? (in terms of bottlenecks)

Interconnection bottlenecks (and possible solutions)  Long wires result in long latencies See if wires can be shortened  Centralized arbitration See if arbitration can be distributed  Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus can be decreased

A Hierarchical Interconnect CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data

SBF A Hierarchical Interconnect A local and a remote SBF (smaller average case latency, longer worse case latency) CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data SBF IOX MC IOX MC CORE NCU CORE NCU L2Data CORE NCU CORE NCU L2Data P2PL

A Hierarchical Interconnect (contd)  The threads need to be mapped intelligently To increase hit rate in caches connected to the local SBF  For some cases, even random mapping results in better performance E.g. for the 8-core processor shown  More research needs to be done for hierarchical interconnects More description in the paper

Conclusions  Design choices for interconnects have significant effect on the rest of the chip Should be co-designed with cores and caches  Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-control important  Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP research proposals  A hierarchical bus structure can negate some of the interconnection performance cost