CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Intermediate GPGPU Programming in CUDA

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA Grids, Blocks, and Threads

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Introduction to GPU programming using CUDA

Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Basic CUDA Programming

Some things are naturally parallel

Introduction to CUDA heterogeneous programming

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

More on GPU Programming

CS/EE 217 – GPU Architecture and Parallel Programming

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

CUDA Grids, Blocks, and Threads

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Synchronization These notes introduce:

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

CUDA C/C++ Basics Part 3 – Shared memory and synchronization REU 2016 By Bhaskar Bhattacharya Adapted from Cyril Zeller, NVIDIA Corporation © NVIDIA Corporation 2011

Things to Note MCA Project Topics

Introduction to CUDA C/C++ What will you learn in this session? Shared memory More about threads _syncthreads() © NVIDIA Corporation 2011

1D Stencil Consider applying a 1D stencil to a 1D array of elements Each output element is the sum of input elements within a radius If radius is 3, then each output element is the sum of 7 input elements: in out © NVIDIA Corporation 2011

Implementing Within a Block 01234567 Each thread processes one output element blockDim.x elements per block radius radius Input elements are read several times With radius 3, each input element is read seven times in out ThreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhread 0 1 2 3 4 5 6 7 8 © NVIDIA Corporation 2011

Sharing Data Between Threads Terminology: within a block, threads share data via shared memory Extremely fast on-chip memory By opposition to device memory, referred to as global memory Like a user-managed cache Declare using shared , allocated per block Data is not visible to threads in other blocks © NVIDIA Corporation 2011

Remember CUDA grid, blocks, threads layout

Implementing With Shared Memory Cache data in shared memory Read (blockDim.x + 2 * radius) input elements from global memory to shared memory Compute blockDim.x output elements Write blockDim.x output elements to global memory Each block needs a halo of radius elements at each boundary in halo on left halo on right out blockDim.x output elements © NVIDIA Corporation 2011

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x temp[lindex temp[lindex } < RADIUS) { - RADIUS] = in[gindex - RADIUS]; + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; © NVIDIA Corporation 2011

Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS result += temp[lindex ; offset <= RADIUS ; offset++) + offset]; // Store the result out[gindex] = result; } © NVIDIA Corporation 2011

Data Race! The stencil example will not work… Suppose thread 15 reads the halo before thread 0 has fetched it… ... temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { Store at temp[18] temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; Skipped since threadId.x > RADIUS } int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; Load from temp[19] ... © NVIDIA Corporation 2011

syncthreads() Synchronizes all threads within a block void syncthreads(); Synchronizes all threads within a block Used to prevent RAW / WAR / WAW hazards All threads must reach the barrier In conditional code, the condition must be uniform across the block © NVIDIA Corporation 2011

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x temp[lindex temp[lindex } < RADIUS) { – RADIUS] = in[gindex – RADIUS]; + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; // Synchronize (ensure all the data is available) syncthreads(); © NVIDIA Corporation 2011

Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS result += temp[lindex ; offset <= RADIUS ; offset++) + offset]; // Store the result out[gindex] = result; } © NVIDIA Corporation 2011

Some more stuff to know Co-ordinating host and device Error checking Device Management

Coordinating Host & Device Kernel launches are asynchronous Control returns to the CPU immediately CPU needs to synchronize before consuming the results cudaMemcpy() Blocks the CPU until the copy is complete Copy begins when all preceding CUDA calls have completed cudaMemcpyAsync() Asynchronous, does not block the CPU cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls have completed © NVIDIA Corporation 2011

Reporting Errors All CUDA API calls return an error code (cudaError_t) Error in the API call itself OR Error in an earlier asynchronous operation (e.g. kernel) Get the error code for the last error: cudaError_t cudaGetLastError(void) Get a string to describe the error: char *cudaGetErrorString(cudaError_t) printf("%s\n", cudaGetErrorString(cudaGetLastError())); © NVIDIA Corporation 2011

Device Management Application can query and select GPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device) Multiple host threads can share a device A single host thread can manage multiple devices cudaSetDevice(i) to select current device cudaMemcpy(…) for peer-to-peer copies © NVIDIA Corporation 2011

CUDA Review What have we learned? Write and launch CUDA C/C++ kernels global , <<<>>>, blockIdx, threadIdx, blockDim Manage GPU memory cudaMalloc(), cudaMemcpy(), cudaFree() Manage communication and synchronization shared , syncthreads() cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize() X © NVIDIA Corporation 2011

Cuda Review topics Launching parallel threads Launch N blocks with M threads per block with kernel<<<N,M>>>(…); Use blockIdx.x to access block index within grid Use threadIdx.x to access thread index within block Allocate elements to threads: int index = threadIdx.x + blockIdx.x * blockDim.x; © NVIDIA Corporation 2011

CUDA Review topics Use shared to declare a variable/array in shared memory Data is shared between threads in a block Not visible to threads in other blocks Use syncthreads() as a barrier Use to prevent data hazards © NVIDIA Corporation 2011

Topics we skipped We skipped some details, you can learn more: CUDA Programming Guide CUDA Zone – tools, training, webinars and more http://developer.nvidia.com/cuda © NVIDIA Corporation 2011

© NVIDIA Corporation 2011 Questions?