Pipeline parallelism and Multi–GPU Programming

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Multi-GPU and Stream Programming Kishan Wimalawarne.
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Waves!. Solving something like this… The Wave Equation (1-D) (n-D)
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Panda: MapReduce Framework on GPU’s and CPU’s
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Threads, Thread management & Resource Management.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
CUDA C/C++ Basics Part 2 - Blocks and Threads
These slides are based on the book:
Multi-GPU Programming
An Update on Accelerating CICE with OpenACC
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
Generalized and Hybrid Fast-ICA Implementation using GPU
Gwangsun Kim, Jiyun Jeong, John Kim
GPU Computing CIS-543 Lecture 10: Streams and Events
CS427 Multicore Architecture and Parallel Computing
Heterogeneous Programming
Graphics Processing Unit
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
NVIDIA Profiler’s Guide
Lecture 2: Intro to the simd lifestyle and GPU internals
Host-Device Data Transfer
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179 Lecture 14.
Lecture 5: GPU Compute Architecture for the last time
CS 179: Lecture 12.
More on GPU Programming
Logistic Regression & Parallel SGD
Chapter 2: The Linux System Part 3
Operating System Concepts
CSE8380 Parallel and Distributed Processing Presentation
Prof. Leonardo Mostarda University of Camerino
CUDA Execution Model – III Streams and Events
Lecture 5: Synchronization and ILP
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
GPU Scheduling on the NVIDIA TX2:
Operating System Concepts
Lecture 12 Input/Output (programmer view)
Presentation transcript:

Pipeline parallelism and Multi–GPU Programming CS 179 Lecture 14 Pipeline parallelism and Multi–GPU Programming

Last time streams HD 0 kernel 0 DH 0 HD 1 kernel 1 DH 1 HD 2 kernel 2

Today Pipeline parallelism Programming multiple GPUs General theme of week: Using all of your computational resources in parallel. been parallelizing solely within the GPU, but there are also other higher-level types of parallelism to consider

Data Parallelism (figures from James Reinders) Converting all characters in an array to upper-case: No dependencies This is what we’ve been doing with CUDA. SIMD model All threads do the same computation. Computations are independent. Data Parallelism (figures from James Reinders)

Task Parallelism Threads compute different independent tasks. Not normally what we do on GPU, but something we could do if we’re careful to avoid thread divergence. Task Parallelism

Pipeline Parallelism Parallelizing over one iteration Quite different from previous forms of parallelism. One computation, but we break it up into multiple parts. Each thread handles one part of computation. Useful to parallelize things that don’t seem easily parallelizable (but does require multiple stages of computation). Requires low-overhead queues (producer-consumer problem from CS 24). This is what we’ve been talking about with CUDA streams and asynchronous functions. Pipeline Parallelism

Pipeline parallelism on GPU while (1) { cudaMemcpy(d_in, h_in, input_size, cudaMemcpyHostToDevice); kernel<<<grid, block>>>(d_input, d_output); cudaMemcpy(h_out, d_out, output_size, cudaMemcpyDeviceToHost); } 3 stage pipeline: Host to Device Kernel Device to Host

Pipeline parallelism in action Stage 0 Stage 1 Stage 2 HD 0 HD 1 kernel 0 HD 2 kernel 1 DH 0 HD 3 kernel 2 DH 1 HD 4 kernel 3 DH 2 HD 5 kernel 4 DH 3 time Stage computed by: HD copy engine SM’s DH copy engine Note we have a separate piece of hardware for each stage of the pipeline. Could run multiple kernels concurrently, but no real advantage because they are still sharing hardware. Pipeline parallelism in action

Pipeline analysis What are the latency and throughput of a pipeline? (Hint: Analyze with with respect to latency and throughput of each stage of pipeline)

Pipeline analysis Stage throughput = 1 / (stage latency) Pipeline throughput = minimum of stage throughputs Pipeline latency = sum of stage latencies All of this assumes a stage can handle one packet of data at a time.

Pipeline throughput analysis = minimum of stage throughputs = minimum of (1 / stage latency) = 1 / (maximum stage latency) Equal because we assume each stage can only handle one packet of data at a time…

Multiple GPUs Can put multiple GPUs in a single computer. CUDA provides interfaces to dispatch work to more than 1 GPU. haru has 3x GTX 570

Simple interface cudaGetDeviceCount - how many CUDA capable GPUs? cudaSetDevice(int i) - execute future commands on GPU i. NVIDIA refers to multiple GPUs as “peers”

Data movement Thanks to unified virtual addressing, you can just use cudaMemcpy with cudaMemcpyDefault to move data between GPUs. Actually possible to DMA to one GPU from another and skip the host entirely. Memcpy breaks concurrency on both GPUs.

Data access Depending on hardware and motherboard layout, peers can have ability to directly access each other’s memory over PCI-E. cudaDeviceCanAccessPeer tells if access is possible cudaDeviceEnablePeerAccess enables peer access.

Peer access is asymmetric Peer access example Peer access is asymmetric // allow device 0 to access device 1 memory cudaSetDevice(0); cudaDeviceEnablePeerAccess(1, 0); // allow device 1 to access device 0 memory cudaSetDevice(1); cudaDeviceEnablePeerAccess(0, 0); Second argument to cudaDeviceEnablePeerAccess is a flag that must be 0 as of right now

Peer access use cases & alternative Peer access use cases are similar to using pinned host memory (both involve all accesses going over PCI-E). Simpler alternative: use managed memory! Also accessible on host, cudaMallocManaged Could imagine pipeline parallelism between GPUs where you want to use peer access for streaming purposes

GPU/GPU synchronization Problem: synchronize 2 GPUs without synchronizing full system (all GPUs + CPU) Solution: cudaStreamWaitEvent. Record an event one 1 GPU and have the other GPUs stream synchronize with it (but not with CPU).

Driving multiple GPUs 2 common options: single threaded process one thread per GPU

How many threads? Single thread / process Pros: simple Cons: constantly have to call cudaSetDevice One thread / GPU Pros: call cudaSetDevice once per thread plays nice with MPI can use multiple CPU cores for computation Cons: complex

cuBLAS-XT NVIDIA’s cuBLAS-XT(Basic Linear Algebra Subprograms) library takes advantage of the sort of full system parallelism we’ve been talking about. Input: arbitrarily sized matrices in host memory Output: matrix product in host memory Programming multiple GPUs is almost like programming a distributed system. Want to minimize communication. Worth exploring for course projects

GPUs across multiple machines More or less the same as doing scientific computation on a cluster without GPUs. MPI commonly used. MPI is a standard API for communicating data between distributed processed. For some networking hardware (such as Infiniband), it’s possible to DMA data straight from network adapter to GPU. This is expensive territory! Speaking of distributed systems/cluster

Conclusion Pipeline parallelism is a great way to think about utilizing all available hardware. Multiple GPUs can increase throughput through either data or pipeline parallelism. Both parallelizing and distributing of algorithms requires careful thought about dependencies (or equivalently synchronization and communication).