More on threads, shared memory, synchronization

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CUDA Grids, Blocks, and Threads

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

© David Kirk/NVIDIA and Wen-mei W

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

CS/EE 217 – GPU Architecture and Parallel Programming

CUDA C/C++ Basics Part 2 - Blocks and Threads

GPU Memories These notes will introduce:

Device Routines and device variables

CUDA and OpenCL Kernels

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides from “PMPP” book

CS 179: GPU Programming Lecture 7.

CS/EE 217 – GPU Architecture and Parallel Programming

CUDA Grids, Blocks, and Threads

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Parallel Computation Patterns (Reduction)

Device Routines and device variables

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Grids, Blocks, and Threads

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Presentation transcript:

More on threads, shared memory, synchronization CUDA More on threads, shared memory, synchronization

cuPrintf Library function for CUDA Developers Copy the files from /opt/cuPrintf into your source code folder int main() { cudaPrintfInit(); testKernel<<< 2, 3 >>>(10); cudaPrintfDisplay(stdout, true); cudaPrintfEnd(); return 0; } #include “cuPrintf.cu” __global__ void testKernel(int val) { cuPrintf(“Value is: %d\n”, val); }

Handling Arbitrarily Long Vectors The limit is 512 threads per block, so there is a failure if the vector is of size N and N/512 > 65535 N > 65535*512 = 33,553,920 elements Pretty big but we could have the capacity for up to 4GB Solution Have to assign range of data values to each thread instead of each thread only operating on one value Next slide: An easy-to-code solution

Approach Have a fixed number of blocks and threads per block Ideally some number to maximize the number of threads the GPU can handle per warp, e.g. 128 or 256 threads per block Each thread processes an element with a stride equal to the total number of threads. Example with 2 blocks, 3 threads per block, 10 element vector Vector 1 2 3 4 5 6 7 8 9 Thread B0T0 B0T1 B0T2 B1T0 B1T1 B1T2 Thread starts work at: (blockIdx.x * (NumBlocks)) + threadIdx.x (blockIdx.x * blockDim.x) + threadIdx.x e.g. B1T0 starts working at (1*2)+0 = index 3 next item to work on is at index 3 + TotalThreads = 3 + 6 = 9 blockDim.x * gridDim.x

Vector Add Kernel For Arbitrarily Long Vectors #define N (100 * 1024) // Length of vector __global__ void add(int *a, int *b, int *c) { int tid = threadIdx.x + (blockIdx.x * blockDim.x); while (tid < N) c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x; } main: Pick some number of blocks less than N, threads to fill up a warp: add<<<128, 128>>>(dev_a, dev_b, dev_c); // 16384 total threads

G80 Implementation of CUDA Memories Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory Shared memory Only shared among threads in the block Is on chip, not DRAM, so fast to access Useful for software-managed cache or scratchpad Must be synchronized if the same value shared among threads Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Global, constant, and texture memory spaces are persistent across kernels called by the same application.

Shared Memory Example Dot Product Book does a more complex version in matrix multiply (x1x2x3x4) ● (y1y2y3y4) =x1y1+x2y2+x3y3+x4y4 When we did this with matrix multiply we had one thread perform this entire computation for a row and column Obvious parallelism idea: Have thread0 compute x1y1 and thread1 compute x2y2, etc. We have to store the individual products somewhere then add up all of the intermediate sums Use shared memory

Shared Memory Dot Product 1 2 3 4 5 6 7 8 9 B 1 2 3 4 5 6 7 8 9 Thread B0T0 B0T1 B0T2 B0T3 B1T0 B1T1 B1T2 B1T3 B0T0 computes A[0]*B[0] + A[8]*B[8] B0T1 computes A[1]*B[1] + A[9]*B[9] Etc. – this will be easier later with threadsPerBlock a power of 2 Store result in a per-block shared memory array: __shared__ float cache[threadsPerBlock]; B0 cache 1 2 3 B1 cache 1 2 3 B0T0 sum B1T0 sum B0T1 sum B1T1 sum B0T2 sum B1T2 sum B0T3 sum B0T3 sum

Kernel Test Code #include "stdio.h" #define N 10 const int THREADS_PER_BLOCK = 4; // Have to be int, not #define; power of 2 const int NUM_BLOCKS = 2; // Have to be int, not #define __global__ void dot(float *a, float *b, float *c) { __shared__ float cache[THREADS_PER_BLOCK]; int tid = threadIdx.x + (blockIdx.x * blockDim.x); int cacheIndex = threadIdx.x; float temp = 0; while (tid < N) temp += a[tid] * b[tid]; tid += blockDim.x * gridDim.x; // THREADS_PER_BLOCK * NUM_BLOCKS } cache[cacheIndex] = temp; if ((blockIdx.x == 0) && (threadIdx.x == 0)) *c = cache[cacheIndex]; // For a test, only send back result of one thread

Main Test Code int main() { float a[N], b[N], c[NUM_BLOCKS]; // We’ll see why c[NUM_BLOCKS] shortly float *dev_a, *dev_b, *dev_c; cudaMalloc((void **) &dev_a, N*sizeof(float)); cudaMalloc((void **) &dev_b, N*sizeof(float)); cudaMalloc((void **) &dev_c, NUM_BLOCKS*sizeof(float)); // Fill arrays for (int i = 0; i < N; i++) a[i] = (float) i; b[i] = (float) i; } // Copy data from host to device cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice); dot<<<NUM_BLOCKS,THREADS_PER_BLOCK>>>(dev_a,dev_b,dev_c); // Copy data from device to host cudaMemcpy(c, dev_c, NUM_BLOCKS*sizeof(float), cudaMemcpyDeviceToHost); // Output results printf("%f\n", c[0]); < cudaFree, return 0 would go here>

Accumulating Sums At this point we have products in cache[] in each block that we have to sum together Easy solution is to copy all these back to the host and let the host add them up O(n) operation If n is small this is the fastest way to go But we can do pairwise adds in logarithmic time This is a common parallel algorithm called reduction

Summation Reduction cache 1 2 3 4 5 6 7 i=8/2=4 cache i=4/2=2 cache 1 2 3 4 5 6 7 i=8/2=4 cache 0+4 1+5 2+6 3+7 4 5 6 7 i=4/2=2 cache 0+4+2+6 1+5+3+7 2+6 3+7 4 5 6 7 i=2/2=1 cache 0+4+2+6+1+5+3+7 1+5+3+7 2 3 4 5 6 7 i=1/2=0 Have to wait for all working threads to finish adding before starting the next iteration

Summation Reduction Code At the end of the kernel after storing temp into cache[cacheIndex]: int i = blockDim.x / 2; while (i > 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; }

Summation Reduction Code We still need to sum the values computed by each block. Since there are not too many of these (most likely) we just return the value to the host and let the host sequentially add them up: int i = blockDim.x / 2; while (i > 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } if (cacheIndex == 0) // We’re thread 0 in this block c[blockIdx.x] = cache[cacheIndex]; // Save the sum in array of blocks

Main dot<<<NUM_BLOCKS,THREADS_PER_BLOCK>>>(dev_a,dev_b,dev_c); // Copy data from device to host cudaMemcpy(c, dev_c, NUM_BLOCKS*sizeof(float), cudaMemcpyDeviceToHost); // Sum and output result float sum = 0; for (int i =0; i < NUM_BLOCKS; i++) { sum += c[i]; } printf("The dot product is %f\n", sum); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); return 0;

Thread Divergence When control flow differs among threads this is called thread divergence Under normal circumstances, divergent branches simply result in some threads remaining idle while others execute the instructions in the branch int i = blockDim.x / 2; while (i > 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; }

Optimization Attempt In the reduction, only some of the threads (always less than half) are updating entries in the shared memory cache What if we only wait for the threads actually writing to shared memory int i = blockDim.x / 2; while (i > 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); } i /= 2; Won’t work; waits until ALL threads In the block reach this point

Summary There are some arithmetic details to map a block’s thread to elements it should compute Shared memory is fast but only accessible by threads in the same block __syncthreads() is necessary when multiple threads access the same shared memory and must be used with care