CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.
Concurrency 1 CS502 Spring 2006 Thought experiment static int y = 0; int main(int argc, char **argv) { extern int y; y = y + 1; return 0; }
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Single Instruction Multiple Threads
Lecture 10 CUDA Instructions
Sathish Vadhiyar Parallel Programming
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Programming Lecture 7.
Lecture 5: GPU Compute Architecture for the last time
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Synchronization These notes introduce:
Parallel Computation Patterns (Histogram)
Force Directed Placement: GPU Implementation
Presentation transcript:

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

CUDA Guarantees All threads in a block in the same SM will run at the same time. All blocks in a kernel will finish before any blocks in the next kernel will start.

Memory Model Every thread has local memory (e.g. local variables) Threads in a block have access to a “per block” shared memory. All threads can read and write to and from global memory. CPU memory is separate from GPU memory and is called host memory.

Synchronization Warning: threads can access and modify each other’s results in shared and global memory. But what if a thread modifies another thread’s data? We need a tool to synchronize memory access and to synchronized thread operations.

Barriers Similar to MPI and OpenMP once a thread reaches a barrier it must wait until all other threads reach the barrier. Then all threads may continue. See the next slide for example code.

Barriers Need for barriers int idx = threadIdx.x; __shared__ int arr[128]; arr[idx] = threadIdx.x; if(idx > 0 && idx <= 127) arr[idx] = arr[idx-1];

Barriers Continued Should be rewritten as int idx = threadIdx.x; __shared__ int array[128]; __syncthreads(); array[idx] = threadIdx.x; if(idx > 0 && idx <= 127) { int temp = arr[idx-1]; __syncthreads(); arr[idx] = arr[idx-1]; __syncthreads(); }

__syncthreads() __syncthreads() creates a barrier between block runs. Implicit barriers also exist between kernel function calls. So, CUDA is a hierarchy of computation, memory, and synchronization primitives.

Efficient CUDA Programs High Level Strategies –Modern GPUs can perform 3 Trillion Math Operations Per Second (3TFLOPS) –Maximize intensity of math operations per unit of memory –Maximize number of useful compute operations per thread –Minimize time spent on memory access per thread

Minimize Time Spent On Memory Move frequently accessed data to shared memory. Memory Speed –Local > Shared >> Global >> Host –Local – registers/L1 cache Local Memory Example __global__ void locMemEx(double f) { double local_f; local_f = f; } int main(int argc, char ** argv) { locMemEx >>(10.2); cudaSynchronize(); }

Global Memory Example //Global memory __global__ void globalMemEx(double * myArr) { myArr[threadIdx.x] = myArr[threadIdx.x]; //myArr is in global memory } int main(int argc, char ** argv) { float * myHostArr = malloc(sizeof(double)*256); float * devArr; cudaMalloc((void **) &devArr, sizeof(double)*256); for(i = 0; i < 256; i++) myHostArr[i] = i; cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(double)*256,cudaMemcpyHostToDevice); globalMemEx >>(devArr); cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(double)*256,cudaMemcpyDeviceToHost); }

Shared Memory Example __global__ void shmemEx(double * arr) { int i, idx = threadIdx.x; double avg, sum = 0.0; __shared__ double shArr[256]; shArr[i] = arr[i]; __syncthreads(); for(i = 0; i < idx; i++){ sum += shArr[i]; } avg = sum / (idx + 1.0); if(arr[idx] > avg) arr[idx] = avg; //This code does not affect results. shArr[idx] += shArr[idx]; }

Code from Main Function shmemEx >>(devArr); cudaMemcpy((void *) hostArr, (void *) devArr, sizeof(double)*256, cudaMemcpyHostToDevice);

Memory Access Want threads to have contiguous memory accesses GPU is most efficient when threads read or write to the same area of memory at the same time Each thread when it accesses global memory must access a chunk of memory, not the single data item –Contiguous is good –Strided, not so good –Random, bad In class exercise, we will draw pictures of each type of memory access.

Memory Conflicts Many threads may try to access the same memory location. Ex: 1,000,000 threads accessing 10 array elements Solve with atomics atomicAdd() atomicMin() atomicXOR() atomicCAS() - compare and swap

Atomic Limitations Only certain operations and data types No mod or exponentiation Mostly integer types Can implement any atomic op with CAS, quite complicated though Still no ordering constraints Floating point arithmetic is non-associative Ex: (a + b) + c != a + (b + c) Serializes memory access –This makes atomic ops very slow

In Class Exercise Try each of the following: –10^6 threads incrementing 10^6 elements –10^5 threads atomically incrementing 10^5 elements –10^6 threads incrementing 1000 elements –10^6 threads atomically incrementing 1000 elements –10^7 threads atomically incrementing 1000 elements Time your results

Avoiding Control Structures In CUDA we want to avoid thread divergence because threads operate on the same kernel code at the same time. –Threads with branch statements will be forced to wait if they are not operating on the same code as all other threads (e.g. if one thread needs to operate on an else and the other on an if). This means we should avoid if statements in GPU code whenever possible.

Divergence Divergence (in terms of threads) means threads that do different things. This can happen in both loops and if statements. This occurs often where loops run for different numbers of iterations. Keep in mind that all other GPU threads have to wait until all divergent threads finish.