Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Introduction to CUDA and GPUGPU Computing

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

Single Instruction Multiple Threads

The Present and Future of Parallelism on GPUs

Lecture 5: Performance Considerations

Basic CUDA Programming

Lecture 5: GPU Compute Architecture

Lecture 16 Revisiting Strides, CUDA Threads…

Mattan Erez The University of Texas at Austin

Lecture 5: GPU Compute Architecture for the last time

CUDA Parallelism Model

General Purpose Graphics Processing Units (GPGPUs)

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 3: CUDA Threads & Atomics

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

Presentation transcript:

Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced Computer Architecture

– 2 – CSCE 513 Fall 2012 Overview Last Time Intro to CUDA/GPU programming Readings for today Stanford – (Itunes) Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, Chapters 1-3New OpenMP Examples – SC 2008 (link ed Tuesday) Nvidia CUDA - example

– 3 – CSCE 513 Fall 2012 Nvidia NVIDIA Developer Zone CUDA Toolkit Downloads C/C++ compiler, CUDA-GDB, Visual Profiler, CUDA Memcheck, GPU-accelerated libraries, Other tools & DocumentationCUDA Toolkit Downloads C/C++ compiler, CUDA-GDB, Visual Profiler, CUDA Memcheck, GPU-accelerated libraries, Other tools & Documentation Developer Drivers Downloads GPU Computing SDK Downloads

– 4 – CSCE 513 Fall 2012 Stanford CS 193G sp2010/wiki/TutorialPrerequisites sp2010/wiki/TutorialPrerequisites Vincent Natol “Kudos for CUDA,” HPC Wire (2010) Patterson, David A.; Hennessy, John L. ( ). Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations ). Elsevier Science (reference). Kindle Edition.

– 5 – CSCE 513 Fall 2012 Lessons from Graphics Pipeline Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 Gthread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run

– 6 – CSCE 513 Fall 2012 Why is this different from a CPU? Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.) multithreading can hide latency => skip the big caches share control logic across many threads

– 7 – CSCE 513 Fall 2012 NVIDIA GPU Architecture Fermi GF100 DRAM I/F HOST I/F Giga Thread DRAM I/F L2

– 8 – CSCE 513 Fall 2012 SM (Streaming Multiprocessor) Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache 32 CUDA Cores per SM (512 total) Each core executes identical instruction or sleeps 24 active warps limit 8x peak FP64 performance 50% of peak FP32 performance Direct load/store to memory Usual linear sequence of bytes High bandwidth (Hundreds GB/sec) 64KB of fast, on-chip RAM Software or hardware-managed Shared amongst CUDA cores Enables thread communication

– 9 – CSCE 513 Fall 2012 Key Architectural Ideas SIMT ( Single Instruction Multiple Thread ) execution threads run in groups of 32 called warps threads in a warp share instruction unit (IU) HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Threads have all resources needed to run any warp not waiting for something can run context switching is (basically) free Register File Scheduler Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Instruction Cache

– 10 – CSCE 513 Fall 2012 C for CUDA Philosophy: provide minimal set of extensions necessary to expose power Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { } Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32]; Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel >> (...); // Launch kernel Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization

– 11 – CSCE 513 Fall 2012 Example: vector_addition // compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];} int main() { // elided initialization code // elided initialization code // Run N/256 blocks of 256 threads each // Run N/256 blocks of 256 threads each vector_add >>(d_A, d_B, d_C); vector_add >>(d_A, d_B, d_C);} Device Code

– 12 – CSCE 513 Fall 2012 Example: vector_addition // compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];} int main() { // elided initialization code // elided initialization code // launch N/256 blocks of 256 threads each // launch N/256 blocks of 256 threads each vector_add >>(d_A, d_B, d_C); vector_add >>(d_A, d_B, d_C);} Host Code

– 13 – CSCE 513 Fall 2012 Example: Initialization code for vector_addition // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // launch N/256 blocks of 256 threads each vector_add >>(d_A, d_B, d_C);

– 14 – CSCE 513 Fall 2012 CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

– 15 – CSCE 513 Fall 2012 DAXPY example in text

– 16 – CSCE 513 Fall 2012

– 17 – CSCE 513 Fall 2012 High Level View SMEM Global Memory CPU Chipset PCIe

– 18 – CSCE 513 Fall 2012 Blocks of threads run on an SM Thread Memory Threadblock Per-block Shared Memory SMEM Streaming ProcessorStreaming Multiprocessor Register s Memory

– 19 – CSCE 513 Fall 2012 Whole grid runs on GPU Many blocks of threads... SMEM Global Memory

– 20 – CSCE 513 Fall 2012 Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory

– 21 – CSCE 513 Fall 2012 Memory Model Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels

– 22 – CSCE 513 Fall 2012 IDs and Dimensions Threads: 3D IDs, unique within a blockBlocks: 2D IDs, unique within a grid Dimensions set at launch Can be unique for each grid Built-in variables: threadIdx, blockIdx blockDim, gridDim Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

– 23 – CSCE 513 Fall 2012 __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; } Kernel with 2D Indexing

– 24 – CSCE 513 Fall 2012 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel >>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; } __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

– 25 – CSCE 513 Fall 2012

– 26 – CSCE 513 Fall 2012 The Problem How do you do global communication? Finish a grid and start a new one

– 27 – CSCE 513 Fall 2012 Global Communication Finish a kernel and start a new one All writes from all threads complete before a kernel finishes step1 >>(...); // The system ensures that all // writes from step1 complete. step2 >>(...);

– 28 – CSCE 513 Fall 2012 Global Communication Would need to decompose kernels into before and after parts

– 29 – CSCE 513 Fall 2012 Race Conditions Or, write to a predefined memory location Race condition! Updates can be lost

– 30 – CSCE 513 Fall 2012 Race Conditions threadId:0 threadId:1917 // vector[0] was equal to 0 vector[0] += 5; vector[0] += 1;... a = vector[0]; What is the value of a in thread 0? What is the value of a in thread 1917?

– 31 – CSCE 513 Fall 2012 Race Conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model, can be arbitrary

– 32 – CSCE 513 Fall 2012 Atomics CUDA provides atomic operations to deal with this problem

– 33 – CSCE 513 Fall 2012 Atomics An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi

– 34 – CSCE 513 Fall 2012 Example: Histogram // Determine frequency of colors in a picture // colors have already been converted into ints // Each thread looks at one pixel and increments // a counter atomically __global__ void histogram(int* color, int* buckets) int* buckets){ int i = threadIdx.x int i = threadIdx.x + blockDim.x * blockIdx.x; + blockDim.x * blockIdx.x; int c = colors[i]; int c = colors[i]; atomicAdd(&buckets[c], 1); atomicAdd(&buckets[c], 1);}

– 35 – CSCE 513 Fall 2012 Example: Workqueue // For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) int* output, int queue_max){ int i = threadIdx.x int i = threadIdx.x + blockDim.x * blockIdx.x; + blockDim.x * blockIdx.x; int q_index = int q_index = atomicInc(q_counter, queue_max); atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); int result = do_work(work_q[q_index]); output[i] = result; output[i] = result;}

– 36 – CSCE 513 Fall 2012 Atomics Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!

– 37 – CSCE 513 Fall 2012 Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single global // maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x int i = threadIdx.x + blockDim.x * blockIdx.x; + blockDim.x * blockIdx.x; int val = values[i]; int val = values[i]; atomicMax(gl_max,val); atomicMax(gl_max,val);}

– 38 – CSCE 513 Fall 2012 Example: Global Min/Max (Better) // introduce intermediate maximum results, so that // most threads do not try to update the global max __global__ void global_max(int* values, int* max, int *regional_maxes, int *regional_maxes, int num_regions) int num_regions){ // i and val as before … // i and val as before … int region = i % num_regions; int region = i % num_regions; if(atomicMax(&reg_max[region],val) < val) if(atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); atomicMax(max,val); }}

– 39 – CSCE 513 Fall 2012 Global Min/Max Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously See next lecture for even better version!

– 40 – CSCE 513 Fall 2012 Summary Can’t use normal load/store for inter-thread communication because of race conditions Use atomic instructions for sparse and/or unpredictable global communication See next lectures for shared memory and scan for other communication patterns Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

– 41 – CSCE 513 Fall 2012 How an SM executes threads Overview of how a Stream Multiprocessor works SIMT Execution Divergence

– 42 – CSCE 513 Fall 2012 Scheduling Blocks onto SMs Thread Block 5 Thread Block 27 Thread Block 61 Streaming Multiprocessor Thread Block 2001 HW Schedules thread blocks onto available SMs No guarantee of ordering among thread blocks HW will schedule thread blocks as soon as a previous thread block finishes

– 43 – CSCE 513 Fall 2012 Warps ALU Contro l ALU Contro l ALU Contro l ALU Contro l ALU Contro l ALU Contro l ALU Contro l ALU A warp = 32 threads launched together Usually, execute together as well

– 44 – CSCE 513 Fall 2012 Mapping of Thread Blocks Each thread block is mapped to one or more warps The hardware schedules each warp independently Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4

– 45 – CSCE 513 Fall Thread Scheduling Example SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM * Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected

– 46 – CSCE 513 Fall 2012 Control Flow Divergence What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B(); }

– 47 – CSCE 513 Fall 2012 Control Flow Divergence From Fung et al. MICRO ‘07 Branch Path A Path B Branch Path A Path B

– 48 – CSCE 513 Fall 2012 Control Flow Divergence Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } else do_C();

– 49 – CSCE 513 Fall 2012 Control Flow Divergence Branch Path A Path C Branch Path B

– 50 – CSCE 513 Fall 2012 Control Flow Divergence You don’t have to worry about divergence for correctness (*) You might have to think about it for performance Depends on your branch conditions

– 51 – CSCE 513 Fall 2012 Control Flow Divergence Performance drops off with the degree of divergence switch(threadIdx.x % N) { case 0:... case 1:... }

– 52 – CSCE 513 Fall 2012 Divergence

– 53 – CSCE 513 Fall 2012 Atomics atomicAdd returns the previous value at a certain address Useful for grabbing variable amounts of data from a list

– 54 – CSCE 513 Fall 2012 Compare and Swap int compare_and_swap(int* register, int oldval, int newval) int oldval, int newval){ int old_reg_val = *register; if(old_reg_val == oldval) *register = newval; return old_reg_val; }

– 55 – CSCE 513 Fall 2012 Compare and Swap Most general type of atomic Can emulate all others with CAS

– 56 – CSCE 513 Fall 2012 Locks Use very judiciously Always include a max_iter in your spinloop! Decompose your data and your locks