Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

Slides:



Advertisements
Similar presentations
All rights reserved © 2006, Alcatel Grid Standardization & ETSI (May 2006) B. Berde, Alcatel R & I.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Nikos Hardavellas, Northwestern University
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/ Computer Architecture.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Router Architecture : Building high-performance routers Ian Pratt
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.
A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
Kevin Lim*, Jichuan Chang +, Trevor Mudge*, Parthasarathy Ranganathan +, Steven K. Reinhardt* †, Thomas F. Wenisch* June 23, 2009 Disaggregated Memory.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
Heterogeneous Memory & Its Impact on Rack-Scale Computing Babak Falsafi Director, EcoCloud ecocloud.ch Contributors: Ed Bugnion, Alex Daglis, Boris Grot,
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
Dragonfly Topology and Routing
Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
Computer System Architectures Computer System Software
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Network Aware Resource Allocation in Distributed Clouds.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Full and Para Virtualization
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
By Islam Atta Supervised by Dr. Ihab Talkhan
PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Presented by: Nick Kirchem Feb 13, 2004
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Architecture and Design of AlphaServer GS320
Lecture 23: Interconnection Networks
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Addressing: Router Design
Cache Memory Presentation I
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
Shared Memory Multiprocessors
Using Packet Information for Efficient Communication in NoCs
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
CS510 - Portland State University
Lecture: Cache Hierarchies
Presentation transcript:

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot

In-Memory Computing for High Performance Tight latency constraints  Keep data in memory 2 Need large memory pool to accommodate all data Dat a

Nodes Frequently Access Remote Memory 3 Need fast access to both small and large objects in remote memory Graph Serving Fine-grained access Graph Analytics Bulk access

Rack-Scale Systems: Fast Access to Big Memory 4 Large memory capacity, low latency, high bandwidth Vast memory pool in small form factor On-chip integrated, cache-coherent NIs Examples: QPI, Scale-Out NUMA [Novakovic et al., ‘14] High-performance inter-node interconnect NUMA environment Low latency for fine-grained transfers High bandwidth for bulk transfers

5 NI cccccc cccccc cccccc cccccc cccccc cccccc cccccc cccccc cccccc cccccc cccccc cccccc : Bandwidth-critical Interaction with cores Data transfer NI Network Remote Memory Access in Rack-Scale Systems Integrated NIs Interaction with cores via memory-mapped queues Directly access memory hierarchy to transfer data : Latency- critical

Network Interface Design for Manycore Chips 6 NI placement & design is key for remote access performance Obvious NI designs suffer from poor latency or bandwidth Contributions NI design space exploration for manycore chips Seamless integration of NI into chip’s coherence domain Novel Split NI design, optimizing for both latency and bandwidth This talk: Low-latency, high-bandwidth NI design for manycore chips

Outline Overview Background NI Design Space for Manycore Chips Methodology & Results Conclusion 7

User-level Remote Memory Access 8 RDMA-like Queue-Pair (QP) model Cores and NIs communicate through cacheable memory- mapped queues Work Queue (WQ) and Completion Queue (CQ) core pol l CQ write WQ pol l soNUMA: Remote memory access latency ≈ 4x of local access Local node Remote node Inter-node network Direct memory access NI

Implications of Manycore Chips on Remote Access 9 NI capabilities have to match chip’s communication demands More cores  Higher request rate Straightforward approach: scale NI across edge Close to network pins Access to NOC’s full bisection bandwidth Caveat: Large average core to NI distance Significant impact of on-chip interactions on end-to-end latency cccccc cccccc cccccc cccccc cccccc cccccc

Edge NI 4-Cache-Block Remote Read 10 QP interactions account for up to 50% of end-to-end latency Network Router QP dir Data handling NI unrolls to cache-block-sized requests Outgoing requests Incoming replies Data transfer completed. Repeat QP interactions to write CQ $ $ Core writes WQ NI reads WQ QP interactions NI writes CQ Core reads CQ Bandwidth ✓ Latency ✗ LLC NOC router cor e dir

NI Design Space 11 Bandwidth Latency Target Edge NI

Reducing the Latency of QP Interactions Collocate NI logic per core, to localize all interactions Still have coherence directory indirection  12 $$ QP dir core

Reducing the Latency of QP Interactions Collocate NI logic per core, to localize all interactions Still have coherence directory indirection  Attach NI cache at core’s L1 back side Similar to a write-back buffer Transparent to coherence mechanism No modifications to core’s IP block 13 Localized QP interactions: latency of multiple NOC traversals avoided! $ $ core

Per-Core NI 4-Cache-Block Remote Read 14 All QP interactions local! Data handling Outgoing requests

15 Minimized latency but bandwidth misuse for large requests All reply packets received! Write CQ locally Data handling Incoming replies Write payload in LLC Bandwidth ✗ Latency ✓ Per-Core NI 4-Cache-Block Remote Read

NI Design Space 16 Bandwidth Latency Target Per-core NI Edge NI

17 Insight: QP interactions best handled at request’s initiation location Data handling best handled at the chip’s edge Split NI addresses the shortcomings of Edge NI and Per-core NI Communication via NOC packets Solution: The two can be decoupled and physically separated! NI Frontend QP interactions Data handling Network packet handling NI Backend How to Have Your Cake and Eat It Too QP interactions Data handling Network packet handling

Network Router All QP interactions local! All reply packets received! Write CQ locally Data handling Outgoing requests Incoming replies 18 Both low-latency small transfers and high-bandwidth large transfers Split NI 4-Cache-Block Remote Read Bandwidth ✓ Latency ✓

NI Design Space 19 Bandwidth Latency Target Per-core NI Edge NI Split NI

Methodology 20 Case study with Scale-Out NUMA [Novakovic et al., ‘14] State-of-the-art rack-scale architecture based on a QP model Cycle-accurate simulation with Flexus Simulated single tiled 64-core chip, remote ends emulated Shared block-interleaved NUCA LLC w/ distributed directory Mesh-based on-chip interconnect DRAM latency: 50ns Remote Read microbenchmarks Data exceeds LLC capacity

Application Bandwidth 21 Split NI: Maximized useful utilization of NOC bandwidth Peak available bandwidth (256GBps)

Latency Results for Single Network Hop 22 Split NI: Remote memory access latency within 13% of ideal Network roundtrip latency (70ns)

Conclusion NI design for manycore chips crucial to remote memory access Fast core-NI interaction critical to achieve low latency Large transfers best handled at the edge, even in NUMA machines Per-Core NI: minimized cost of core-NI interaction Seamless core-NI collocation, transparent to chip’s coherence Split NI: low-latency, high-bandwidth remote memory access Non-intrusive design – No modifications to core’s IP block 23

Thank you! Questions?