Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Slides:

Advertisements

Similar presentations

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Advertisements

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Comparison Of Network On Chip Topologies Ahmet Salih BÜYÜKKAYHAN Fall.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.

Misbah Mubarak, Christopher D. Carothers

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

QuT: A Low-Power Optical Network-on-chip

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Multi-GPU System Design with Memory Networks

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Storage area network and System area network (SAN)

Dragonfly Topology and Routing

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Gigabit Routing on a Software-exposed Tiled-Microprocessor

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Presenter: Jonathan Murphy On Adaptive Routing in Wavelength-Routed Networks Authors: Ching-Fang Hsu Te-Lung Liu Nen-Fu Huang.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

University of Michigan, Ann Arbor

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Lynn Choi School of Electrical Engineering

How to Train your Dragonfly

Architecture and Algorithms for an IEEE 802

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Rahul Boyapati. , Jiayi Huang

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Mingxing Zhang, Youwei Zhuo (equal contribution),

Using Packet Information for Efficient Communication in NoCs

Storage area network and System area network (SAN)

Chapter 4 Multiprocessors

A Case for Interconnect-Aware Architectures

Active-Routing: Compute on the Way for Near-Data Processing

Rajeev Balasubramonian

Presentation transcript:

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn, Jaeha Kim Seoul National University

Memory Wall  Core count – Moore’s law : 2x in 18 months.  Pin count – ITRS Roadmap : 10% per year.  core growth >> Memory bandwidth growth  Memory bandwidth can continue to become bottleneck  Capacity, energy issues and so on.. [Lim et al., ISCA’09]

Hybrid Memory Cubes (HMCs)  Solution for memory bandwidth & energy challenges.  HMC provides routing capability  HMC is a router DRAM layers Logic layer Processor High-speed signaling TSV MC … I/O Interconnect … Ref.: “Hybrid Memory Cube Specification 1.0,” [Online]. Available: Hybrid Memory Cube Consortium, How to interconnect multiple CPUs and HMCs? (Packetized high-level messages) Packet

Memory Network … HMC … Memory Network CPU

Interconnection Networks On-chip Interconnection networks Supercomputers Cray X1 Router fabrics Avici TSR I/O systems Myrinet/Infiniband MIT RAW Memory

How Is It Different? Interconnection Networks (large-scale networks) Memory Network Nodes vs. Routers Network Organization Important Bandwidth Cost Others # Nodes ≥ # Routers# Nodes < # Routers (or HMCs) Concentration Distribution Bisection Bandwidth CPU Bandwidth Channel 1)Intra-HMC network 2)“Routers” generate traffic

Conventional System Interconnect  Intel QuickPath Interconnect / AMD HyperTransport  Different interface to memory and other processors. CPU1 CPU3 CPU0 CPU2 Shared parallel bus High-speed P2P links

Adopting Conventional Design Approach  CPU can use the same interface for both memory/other CPUs.  CPU bandwidth is statically partitioned. CPU1 CPU3 CPU0 CPU2 HMC Same links

Bandwidth Usage Ratio Can Vary  Ratio of QPI and Local DRAM traffic for SPLASH-2. Real quad-socket Intel Xeon system measurement.  We propose Memory-centric Network to achieve flexible CPU bandwidth utilization. ~2x difference in coherence/memory traffic ratio

Contents  Background/Motivation  Design space exploration  Challenges and solutions  Evaluation  Conclusions

Leveraging Routing Capability of the HMC Local HMC traffic BW CPU-to-CPU traffic BW CPU HMC Conventional DesignMemory-centric Design 50% 100% Bandwidth Comparison HMC CPU bandwidth can be flexibly utilized for different traffic patterns. Other CPUs Other HMCs Coherence Packet Coherence Packet

System Interconnect Design Space … CP U … Network … Processor-centric Network (PCN) HMC CP U …… Network … CP U Memory-centric Network (MCN) HMC … CP U … Network … Hybrid Network HMC

Interconnection Networks 101 Average packet latency Offered load  Latency –Distributor-based Network –Pass-thru Microarchitecture  Throughput –Distributor-based Network –Adaptive (and non-minimal routing) Zero-load latency Saturation throughput

Memory-centric Network Design Issues  Key observation: # Routers ≥ # CPUs  Large network diameter.  CPU bandwidth is not fully utilized. Mesh CPU Flattened Butterfly [ISCA’07] CPU Dragonfly [ISCA’08] 5 hops CPU group

Network Design Techniques Network CPU … … Baseline CPU HMC CPU … Concentration CPU … Network … HMC Network CPU … Distribution … CPU … HMC

Distributor-based Dragonfly Distributor-based Network  Distribute CPU channels to multiple HMCs. –Better utilize CPU channel bandwidth. –Reduce network diameter.  Problem: Per hop latency can be high –Latency = SerDes latency + intra-HMC network latency 3 hops CPU Dragonfly [ISCA’08] 5 hops

Reducing Latency: Pass-thru Microarchitecture  Reduce per-hop latency for CPU-to-CPU packets.  Place two I/O ports nearby and provide pass-thru path. –Without serialization/deserialization. 5GHz Rx Clk DES RC_A SER RC_B 5GHz Tx Clk Input port AOutput port B Datapath Fall-thru path Pass-thru path DRAM (stacked) Memory Controller Channel I/O port Pass-thru

Leveraging Adaptive Routing  Memory network provides non-minimal paths.  Hotspot can occur among HMCs. –Adaptive routing can improve throughput. Minimal path H0 H1 H2 H3 CPU … … … … Non-minimal path

Methodology  Workload –Synthetic traffic: request-reply pattern –Real workload: SPLASH-2  Performance –Cycle-accurate Pin-based simulator  Energy: –McPAT (CPU) + CACTI-3DD (DRAM) + Network energy  Configuration : –4CPU-64HMC system –CPU: 64 Out-of-Order cores –HMC: 4 GB, 8 layers x 16 vaults

Evaluated Configurations Configuration NameDescription PCNPCN with minimal routing PCN+passthruPCN with minimal routing and pass-thru enabled HybridHybrid network with minimal routing Hybrid+adaptiveHybrid network with adaptive routing MCNMCN with minimal routing MCN+passthruMCN with minimal routing and pass-thru enabled  Representative configurations for this talk.  More thorough evaluation can be found in the paper.

Synthetic Traffic Result (CPU-Local HMC)  Each CPU sends requests to its directly connected HMCs.  MCN provides significantly higher throughput.  Latency advantage depends on traffic load. Average transaction latency (ns) 50% higher throughput PCN+passthru is better MCN is better

Synthetic Traffic Result (CPU-to-CPU)  CPUs send request to other CPUs.  Using pass-thru reduced latency for MCN.  Throughput: PCN < MCN+pass-thru < Hybrid+adaptive routing 27% Latency reduction by pass-thru Average transaction latency (ns) 20%62% PCN, hybrid is better MCN is better

Real Workload Result – Performance  Impact of memory-centric network: –Latency-sensitive workloads performance is degraded. –Bandwidth-intensive workloads performance is improved.  Hybrid network+adaptive provided comparable performance. 33% 12% Normalized Runtime 7% 22% 23%

Real Workload Result – Energy  MCN have more links than PCN  increased power  More reduction in runtime  energy reduction (5.3%)  MCN+passthru used 12% less energy than Hybrid+adaptive. Normalized Energy 5.3% 12%

Conclusions  Hybrid Memory Cubes (HMC) enable new opportunities for a “memory network” in system interconnect.  Distributor-based network proposed to reduce network diameter and efficiently utilize processor bandwidth  To improve network performance: –Latency : Pass-through uarch to minimize per-hop latency –Throughput : Exploit adaptive (non-minimal) routing  Intra-HMC network is another network that needs to be properly considered.