© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
More on threads, shared memory, synchronization
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
Cg Programming Mapping Computational Concepts to GPUs.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
CUDA - 2.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
CS427 Multicore Architecture and Parallel Computing
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel Computation Patterns (Reduction)
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
Presentation transcript:

© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor: Kevin Skadron

© 2010 Michael Boyer2 Outline GPU architecture Programming GPUs using CUDA Case study: Leukocyte Tracking Current work: CPU-GPU Task Sharing

© 2010 Michael Boyer3 Graphics Processors Graphics Processing Units (GPUs) are designed specifically for graphics rendering applications Courtesy of GameSpot

© 2010 Michael Boyer4 Graphics Applications Graphics applications involve applying the same operation to many pieces of data Application characteristics: – Massively parallel – Only aggregate performance matters

© 2010 Michael Boyer5 CPU vs. GPU: Architectural Difference 1 Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode Execute Register File CPUGPU Avoid structures that only improve single-thread performance

© 2010 Michael Boyer6 CPU vs. GPU: Architectural Difference 2 Fetch/ Decode Fetch/ Decode Execute Register File Amortize the overhead of control logic across multiple execution units (SIMD processing) EXE RF EXE RF EXE RF EXE RF Thread Group Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU

© 2010 Michael Boyer7 EXE RF EXE RF EXE RF EXE RF CPU vs. GPU: Architectural Difference 3 Fetch/ Decode Fetch/ Decode EXE RF EXE RF EXE RF EXE RF Thread GroupThread Group 1 Thread Group 2 Thread Group 3 Thread Group 4 Use multiple groups of threads to keep execution units busy and hide memory latency Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU

© 2010 Michael Boyer8 CPU vs. GPU: Architectural Difference 4 EXE RF EXE RF EXE RF EXE RF Fetch/ Decode Fetch/ Decode Replicate cores to leverage more parallelism Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU CPU Core GPU Core Core 1 Core 2 Core 3 Core 4 Core 7 Core 8 Core 9 Core 10 Core 12 Core 13 Core 14 Core 15 Core 17 Core 18 Core 19 Core 20 Core 22 Core 23 Core 24 Core 25 Core 27 Core 28 Core 29 Core 30 Core 2 Core 3 Core 4 Core 5 Core 6 Core 11 Core 16 Core 21 Core 26 Core 1

© 2010 Michael Boyer9 CPU vs. GPU: Architectural Differences Summary: take advantage of abundant parallelism – Lots of threads, so focus on aggregate performance – Parallelism in space: SIMD processing in each core Many independent SIMD cores across the chip – Parallelism in time: Multiple SIMD groups in each core

© 2010 Michael Boyer10 CPU vs. GPU: Peak Performance Processor TypeCPUGPU Product Intel Xeon W5590 (Nehalem) AMD Radeon HD 5870 Throughput (GFLOPs) 1072,720 Memory Bandwidth (GB/s) Cost$1,700$450 Note that these are peak numbers What we really care about is performance on real-world applications

© 2010 Michael Boyer11 General-Purpose Computing on GPUs Lots of recent interest in using GPUs to run non- graphics applications (GPGPU) Why GPUs? Why now? – Recent increases in performance via parallelism – Recent increases in programmability – Ubiquity in multiple market segments Old approach: graphics languages New approach: GPGPU languages – OpenCL, CUDA

© 2010 Michael Boyer12 CUDA Programming model for running general-purpose applications on NVIDIA GPUs Extension to the C programming language GPU is a co-processor: – Main program runs on the CPU – Large computations (kernels) are offloaded to the GPU – CPU and GPU have separate memory, so data must be transferred back and forth

© 2010 Michael Boyer13 CUDA: Typical Program Structure void function(…) { Allocate memory on the GPU Transfer input data to the GPU Launch kernel on the GPU Transfer output data to CPU } __global__ void kernel(…) { Code executed on the GPU goes here… } CPU CPU Memory GPU GPU Memory

© 2010 Michael Boyer14 CUDA: Typical Program Transformation for (i = 0; i < N; i++) { Process array element i } Body of loop becomes body of kernel __global__ void kernel(…) { Determine this thread’s value of i Process array element i }

© 2010 Michael Boyer15 CUDA Kernel Scalar program invoked across many threads – Typically one thread per data element Overall computation decomposed into a grid of thread blocks – Thread blocks are independent and cannot communicate (with some exceptions) – Threads within the same block can communicate Thread Block 1 Thread Block 2 Thread Block 3 Thread Block 4 Thread Block 5

© 2010 Michael Boyer A B C + = Simple Example: Vector Addition C = A + B

© 2010 Michael Boyer17 C Code float *CPU_add_vectors(float *A, float *B, int N) { // Allocate memory for the result float *C = (float *) malloc(N * sizeof(float)); // Compute the sum; for (int i = 0; i < N; i++) { C[i] = A[i] + B[i]; } // Return the result return C; }

© 2010 Michael Boyer18 CUDA Kernel // GPU kernel that computes the vector sum C = A + B // (each thread computes a single value of the result) __global__ void kernel(float *A, float *B, float *C, int N) { // Determine which element this thread is computing int i = blockDim.x * blockIdx.x + threadIdx.x; // Compute a single element of the result vector if (i < N) { C[i] = A[i] + B[i]; }

© 2010 Michael Boyer19 CUDA Host Code float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) { // Allocate GPU memory for the inputs and the result int vector_size = N * sizeof(float); float *A_GPU, *B_GPU, *C_GPU; cudaMalloc((void **) &A_GPU, vector_size); cudaMalloc((void **) &B_GPU, vector_size); cudaMalloc((void **) &C_GPU, vector_size); // Transfer the input vectors to GPU memory cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice); cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice); // Execute the kernel to compute the vector sum on the GPU int num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK); kernel >> (A_GPU, B_GPU, C_GPU, N); // Transfer the result vector from the GPU to the CPU float *C_CPU = (float *) malloc(vector_size); cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost); return C_CPU; }

© 2010 Michael Boyer20 Example Program Output./vector_add 50,000,000 GPU: Transfer to GPU: sec Kernel execution: sec Transfer from GPU: sec Total: sec CPU: sec Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x Vector addition does not do enough work per memory operation to justify offload!

© 2010 Michael Boyer21 Case Study: Leukocyte Tracking

© 2010 Michael Boyer22 Leukocyte Tracking Important for evaluating inflammatory drugs Velocity measured by tracking leukocytes through multiple frames Current approaches: – Manual analysis: 1 minute video in tens of hours – MATLAB: 1 minute video in 5 hours

© 2010 Michael Boyer23 Goal: Leverage CUDA and a GPU to accelerate leukocyte tracking to near real-time speeds

© 2010 Michael Boyer24 Acceleration 1.Translation: convert MATLAB code to C 2.Parallelization: – OpenMP for multi-core CPU – CUDA for GPU Experimental setup: – CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770 – GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)

© 2010 Michael Boyer25 Tracking Algorithm Inputs: Video frame Location of cells in previous frame Output: Location of cells in current frame For each cell: – Extract sub-image near cell’s old location – Compute MGVF matrix over sub-image – Evolve active contour using MGVF matrix → 99.8%

© 2010 Michael Boyer26 Computing the MGVF Matrix Motion Gradient Vector Flow MGVF matrix is approximated via an iterative solution procedure Sub-image near cellMGVF

© 2010 Michael Boyer27 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) MGVF Pseudo-code

© 2010 Michael Boyer28 Naïve CUDA Implementation Kernel is called ~50,000 times per frame Amount of work per call is small Runtime dominated by CUDA overheads: – Memory allocation, memory copying, kernel call overhead

© 2010 Michael Boyer29 Kernel Overhead Kernel calls are not cheap! – Overhead of one kernel call: 9 microseconds – Overhead of one CPU function: 3 nanoseconds – Kernel call is 3,000 times more expensive Heaviside kernel: – 27% of kernel runtime due to computation – 73% of kernel runtime due to kernel overhead

© 2010 Michael Boyer30 Lesson 1: Reduce Kernel Overhead Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation

© 2010 Michael Boyer31 Larger Kernel Implementation MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged)

© 2010 Michael Boyer32 Larger Kernel Implementation

© 2010 Michael Boyer33 Memory Allocation Overhead

© 2010 Michael Boyer34 Lesson 2: Reduce Memory Management Overhead Reduce the number of memory allocations – Allocate memory once and reuse it throughout the application – If memory size is not known a priori, estimate and only re- allocate if estimate is too small

© 2010 Michael Boyer35 Reduced Allocation Implementation

© 2010 Michael Boyer36 Memory Transfer Overhead

© 2010 Michael Boyer37 Lesson 3: Reduce Memory Transfer Overhead If the CPU operates on values produced by the GPU: – Move the operation to the GPU – May improve performance even if the operation itself is slower on the GPU Operation (GPU) Time values produced by GPU values consumed by GPU Memory Transfer Operation (CPU) Memory Transfer

© 2010 Michael Boyer38 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) GPU Reduction Implementation

© 2010 Michael Boyer39 GPU Reduction Implementation

© 2010 Michael Boyer40 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Persistent Thread Block

© 2010 Michael Boyer41 Persistent Thread Block Problem: need a global memory fence – Multiple thread blocks compute the MGVF matrix – Thread blocks cannot communicate with each other – So each iteration requires a separate kernel call Solution: compute entire matrix in one thread block – Arbitrary number of iterations can be computed in a single kernel call

© 2010 Michael Boyer42 Persistent Thread Block: Example Canonical CUDA Approach (1-to-1 mapping between threads and data elements) Persistent Thread Block MGVF Matrix

© 2010 Michael Boyer43 GPU SM Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 SM Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Persistent Thread Block: Example Cell 1 Canonical CUDA Approach (1-to-1 mapping between threads and data elements) Persistent Thread Block SM = Streaming Multiprocessor (GPU core)

© 2010 Michael Boyer44 Lesson 4: Avoid Global Memory Fences Confine dependent computations to a single thread block – Execute an iterative algorithm until convergence in a single kernel call – Only efficient if there are multiple independent computations

© 2010 Michael Boyer45 Persistent Thread Block Implementation 27x

© 2010 Michael Boyer46 Absolute Performance

© 2010 Michael Boyer47 Video Example

© 2010 Michael Boyer48 Conclusions CUDA overheads can be significant bottlenecks CUDA provides enormous performance improvements for leukocyte tracking – 200x over MATLAB – 27x over OpenMP Processing time reduced from > 4.5 hours to < 1.5 minutes Real-time analysis feasible in near future M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009.

© 2010 Michael Boyer49 Current work: CPU-GPU Task Sharing

© 2010 Michael Boyer50 CPU-GPU Task Sharing Offloading decision is generally considered to be binary GPU?CPU?

© 2010 Michael Boyer51 CPU-GPU Task Sharing Offload decision does not need to be binary! – Dividing a task between the CPU and GPU can provide improved performance over either device alone GPU CPU GPU?CPU?

© 2010 Michael Boyer52 Theoretical Performance

© 2010 Michael Boyer53 Research Goal 1.Given an input program written in CUDA or OpenCL, automatically generate a program that can execute on the CPU and GPU concurrently 2.Automatically determine best division of work: – When beneficial, share work between CPU and GPU – Otherwise, execute on CPU or GPU exclusively – Optimal decision can change at runtime: With different inputs With contention

© 2010 Michael Boyer54 Proposed System OpenCL code Source-to-source Translation Framework OpenCL Compiler Modified OpenCL code CPU/GPU binary Transform all GPU memory allocations, memory transfers, and kernel launches into a form supporting concurrent CPU-GPU execution

© 2010 Michael Boyer55 Potential Problems One version of the kernel for multiple devices – Optimizations for GPU may hurt performance on CPU and vice versa Possible (but rare) for thread blocks to communicate with each other – Do we try to support this? Statically predicting data access patterns can be hard (or even impossible for some applications)

© 2010 Michael Boyer56 Data Sharing CPUGPU If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory 1) Computation 2) Data Transfer

© 2010 Michael Boyer57 Data Sharing (2) CPUGPU If we can predict data access patterns statically, then we can minimize the data transfer overhead 1) Computation 2) Data Transfer

© 2010 Michael Boyer58 Preliminary Results (HotSpot)

© 2010 Michael Boyer59 Conclusions GPUs are designed to provide good performance on graphics workloads – But they have evolved to support any workload with abundant parallelism GPUs can provide large performance improvements – But we need to take into account the overheads involved to fully take advantage Allowing the CPU and GPU to work together can provide an even larger performance improvement

© 2010 Michael Boyer60 Acknowledgements Funding provided by: – NSF grant IIS – SRC grant – NVIDIA research grant – GRC AMD/Mahboob Kahn Ph.D. fellowship Equipment donated by NVIDIA

© 2010 Michael Boyer61 BACKUP

© 2010 Michael Boyer62 3D Rendering APIs Graphics Application Vertex Program Rasterization Fragment Program Display High-level abstractions for rendering geometry Courtesy of D. Luebke, NVIDIA

© 2010 Michael Boyer63 CUDA: Abstractions 1.Kernel function – Mapped onto a grid of thread blocks 2.Scratchpad memory – For sharing data within a thread block 3.Barrier synchronization – For synchronizing within a thread block

© 2010 Michael Boyer64 Kernel Function __global__ void kernel(int *in, int *out) { // Determine this thread ’ s index int i = threadIdx.x; // Add one to the input value out[i] = in[i] + 1; }

© 2010 Michael Boyer65 Grid of Thread Blocks Grid: 2-dimensional ≤ 4.3 billion blocks Thread block: 3-dimensional ≤ 512 threads

© 2010 Michael Boyer66 Launching a Kernel int num_threads =...; int threads_per_block = 256; // Determine how many thread blocks are needed // (using either of the two methods shown below) int num_blocks = ceil(num_threads / threads_per_block); int num_blocks = (num_threads + threads_per_block – 1) / threads_per_block; // Make structures for grid and thread block dimensions dim3 grid(num_blocks, 1); dim3 thread_block(threads_per_block, 1, 1); // Launch the kernel kernel >> (in, out);

© 2010 Michael Boyer67 Scratchpad Memory Each multiprocessor has 16 KB of software-controlled shared memory Variables declared “__shared__” get mapped into this memory Values can only be shared among threads within the same thread block

© 2010 Michael Boyer68 Scratchpad Memory: Example __global__ void kernel() { int i = threadIdx.x; // Compute some function int v = foo(i); // Write the value into shared memory __shared__ int values[THREADS_PER_BLOCK]; values[i] = v; // Use the shared values... }

© 2010 Michael Boyer69 Barrier Synchronization __syncthreads() function Each thread waits for all other threads in the thread block All values written by every thread are now visible to all other threads

© 2010 Michael Boyer70 Barrier Synchronization: Example __global__ void kernel(float *out, int N) { int i = threadIdx.x; __shared__ int values[THREADS_PER_BLOCK]; values[i] = foo(i); // Wait to ensure all values have been written __syncthreads(); // Compute average of two values out[i] = (values[i] + values[(i + 1) % N]); }

© 2010 Michael Boyer71 CUDA Overheads Driver initialization: 0.14 seconds Kernel launch: 13 μs GPU memory allocation and deallocation: orders of magnitude slower than on CPU Memory transfer: 15 μs + 1 ms/MB

© 2010 Michael Boyer72 Program Allocate GPU memory Transfer input data Launch kernel Transfer results Free GPU memory Acceleration using CUDA CPUGPU Step 1: Determine which code to offload to the GPU as a CUDA kernel Step 2: Write the CPU-side CUDA code Step 3: Write and optimize the GPU kernel CUDA kernel

© 2010 Michael Boyer73 Performance Issues Branch divergence Memory coalescing Key concept: Warp – Group of threads that execute concurrently – In current hardware, warp size is 32 threads

© 2010 Michael Boyer74 Branch Divergence Remember: hardware is SIMD What if threads in the same warp follow two different paths? Solution: entire warp executes both paths – Unneeded values are simply ignored – Performance can suffer with many divergent branches

© 2010 Michael Boyer75 Memory Coalescing Threads in the same half-warp access memory together If all threads access successive memory locations: – All of the accesses are combined (coalesced) – Result: significantly improved memory performance Otherwise: – Each thread accesses memory separately – Result: significantly reduced memory performance

© 2010 Michael Boyer76 Memory Coalescing: Examples Coalesced Accesses Non-Coalesced Access

© 2010 Michael Boyer77 Parallelization Granularity CPU CPU Memory GPU GPU Memory

© 2010 Michael Boyer78 Memory Transfer Kernel Overhead Revisited Overhead depends on calling pattern: – One at a time (synchronous): 9 microseconds – Back-to-back (asynchronous): 3 microseconds Kernel Call Synchronous: Asynchronous: Implicit Synchronization Kernel Call

© 2010 Michael Boyer79 Lesson 1 Revisited: Reduce Kernel Overhead Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation Launch kernels back-to-back – Kernel calls are asynchronous: avoid explicit or implicit synchronization between kernel calls – Overlap kernel execution on the GPU with driver access on the CPU