Multi-GPU System Design with Memory Networks

Slides:

Advertisements

Similar presentations

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Chapter 10 Switching Fabrics. Outline Physical Interconnection Physical box with backplane Individual blades plug into backplane slots Each blade contains.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

1 Chapter 4 Threads Threads: Resource ownership and execution.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Supporting GPU Sharing in Cloud Environments with a Transparent

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter

GPU Programming with CUDA – Optimisation Mike Griffiths

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system.

Lecture 25 PC System Architecture PCIe Interconnect

Full and Para Virtualization

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Sunpyo Hong, Hyesoon Kim

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Background Computer System Architectures Computer System Software.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.

Gwangsun Kim, Jiyun Jeong, John Kim

Seth Pugsley, Jeffrey Jestes,

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Department of Computer Science University of California, Santa Barbara

Department of Computer Science University of California, Santa Barbara

Active-Routing: Compute on the Way for Near-Data Processing

Presentation transcript:

Multi-GPU System Design with Memory Networks Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of Science and Technology

Single-GPU Programming Pattern Device Memory Host Memory Data

Multi-GPU Programming Pattern … Problems: 1. Programming can be challenging. 2. Inter-GPU communication cost is high. Device Memory How to place the data? A. Split B. Duplicate Host Memory Data 1 1 2 3 4 5 6 7 … 2 4 3 5 7 … 6

Hybrid Memory Cube (HMC) Vault controller … Vault controller Vault DRAM layers Intra-HMC Network Logic layer I/O port … I/O port High-speed link Packet

Memory Network Memory network for multi-CPU [Kim et al., PACT’13] GPU Logic layer High-speed link DRAM layers I/O port … Vault controller Intra-HMC Network GPU GPU CPU

… Related Work NVLink for Nvidia Pascal architecture … Drawback: some processor bandwidth dedicated to NVLink. SLI (Nvidia) and Crossfire (AMD) Graphics only. Unified virtual addressing from Nvidia Easy access to other GPU’s memory Restriction in memory allocation. MEM GPU … IO Hub CPU PCIe Switches … NVLink

Contents Motivation Related work Inter-GPU communication Scalable kernel execution (SKE) GPU memory network (GMN) design CPU-GPU communication Unified memory network (UMN) Overlay network architecture Evaluation Conclusion

GPU Memory Network Advantage 288 GB/s 15.75 GB/s PCIe PCIe (optional) GPU Memory Network Device Memory Separate physical address spaces Unified physical address space

Scalable Kernel Execution (SKE) Executes an unmodified kernel on multiple GPUs. GPUs need to support partial execution of a kernel. Kernel Partitioned Kernel Kernel Original Kernel Virtual GPU GPU GPU GPU GPU GPU Source transformation [Kim et al., PPoPP’11] [Lee et al., PACT’13] [Cabezas et al., PACT’14] Single-GPU Multi-GPU with SKE

Scalable Kernel Execution Implementation Thread block 1D Kernel Thread Block range for GPU 0 Block range for GPU 1 Block range for GPU 2 ... Original kernel meta data + Block range Virtual GPU command queue Virtual GPU Application (unmodified single-GPU version) Original kernel meta data SKE Runtime GPU command queue GPU0 GPU1 …

Memory Address Space Organization GPU virtual address space Fine-grained interleaving Load-balanced … Cache line 0 Cache line 1 Cache line 2 Cache line 3 Cache line 4 Cache line 5 Page A  GPU X GPU Page B  GPU Y Non-minimal path Page C  GPU Z … … Minimal path …

Multi-GPU Memory Network Topology Load-balanced GPU channels Remove path diversity among local HMCs 2D Flattened butterfly w/o concentration [ISCA’07] (FBFLY) GPU HMC 50% 43% 33% Load-balanced Distributor-based flattened butterfly [PACT’13] (dFBFLY) Sliced flattened butterfly (sFBFLY)

Contents Motivation Related work Inter-GPU communication Scalable kernel execution (SKE) GPU memory network (GMN) design CPU-GPU communication Unified memory network (UMN) Overlay network architecture Evaluation Conclusion

Data Transfer Overhead CPU GPU Problems: 1. CPU-GPU communication BW is low. 2. Data transfer (or memory copy) overhead. PCIe Data 1 2 … Low BW Host memory Device memory

Unified Memory Network Remove PCIe bottleneck between CPU and GPUs. Eliminate memory copy between CPU and GPUs! Unified Memory Network GPU Memory Network CPU … … … … … … … … … … GPU GPU GPU … PCIe Switches IO Hub

Overlay Network Architecture CPUs are latency-sensitive. GPUs are bandwidth-sensitive. Off-chip link On-chip pass-thru path [PACT’13, FB-DIMM spec.] CPU GPU

Methodology GPGPU-sim version 3.2 Assume SKE for evaluation Configuration 4 HMCs per CPU/GPU 8 bidirectional channels per CPU/GPU/HMC PCIe BW: 15.75 GB/s, latency: 600 ns HMC: 4 GB, 8 layers, 16 vaults, 16 banks/vault, FR-FCFS Assume 1CPU-4GPU unless otherwise mentioned. Abbreviation Configuration PCIe PCIe-based system with memcpy GMN GPU memory network-based system with memcpy UMN Unified memory network-based system (no copy)

SKE Performance with Different Designs Results for selected workloads 82% reduction Compute-intensive Data-intensive *Lower is better

Impact of Removing Path Diversity b/w Local HMCs 14% higher 9% lower <1% diff. *Lower is better

Input size not large enough Scalability 13.5x # GPUs Compute-intensive Input size not large enough *Higher is better

Conclusion We addressed two critical problems in multi-GPU systems with memory networks. Inter-GPU communication Improved bandwidth with GPU memory network Scalable Kernel Execution  Improved Programmability CPU-GPU communication Unified memory network  Eliminate data transfer Overlay network architecture Our proposed designs improve both performance and programmability of multi-GPU systems.