Presentation is loading. Please wait.

Presentation is loading. Please wait.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

Similar presentations


Presentation on theme: "Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf."— Presentation transcript:

1 Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf in CUDA code

2 2-dimension indexing Threads and blocks have built-in IDs to distinguish threads from each other, control their control flow, and control how they access data. CUDA supports up to 3 dimensional built-in variables for blocks and threads. – (blockIdx.x, blockIdx.y, blockIdx.z) – (threadIdx.x, threadIdx.y, threadIdx.z)

3 Launch kernels supporting 2 or 3 dimensional indexing To launch kernels supporting 2 or 3 dimensional indexing for threads or blocks, or both, requires using a dim3 data type. dim3 is an integer vector type based on uint3.

4 Color Format - ARGB 8 bits for Alpha – Followed by 8 bits for Red Followed by 8 bits for Green – Followed by 8 bits for Blue.

5 synchronization A __syncthreads(); call should be hit by all the threads in a block. When only some of the threads reach a __syncthreads(); statement and other threads in the block do not, then the behavior is undefined. __syncthreads(); calls within for-loops in the kernel require that no threads break out of the for loop early. – for (int a = threadIdx.x; a < length; a+=blockDim.x) should be avoided when __syncthreads(); is called within the body of a for loop. Instead use for (int a = 0; a < length; a+=blockDim.x) and use conditionals around inner code, as shown in the next slide.

6 how to use __syncthreads(); within a for-loop iterating over an array

7 Shared Memory Tiling With Boarders Common approaches to using shared memory in image processing. 1)Global Memory Input -> load into Shared Memory Tile for fast access to Input Data -> threads compute output -> store directly in Global Memory Output. 2)Global Memory Input -> load into Shared Memory Tile for fast access to Input Data -> threads compute to and store in Shared Memory Output Tile -> syncthreads, and then do more rounds of computation on Shared Memory Output Tile -> store results in global memory. Usually a shared memory tile is used simply to increase the performance of reading input data, when that input data is read multiple times by threads accessing the same area of data.

8 Loading to shared memory, when the tile is larger than the number of threads The entire grid represents the shared memory tile. The red squares represent threads mapped to their assigned pixels The blue squares represent the border pixels. They are there because the image processing algorithms such as edge detection require threads iterating over pixels adjacent to the pixels they are mapped to.

9 Four iterations to load the tile Iteration 1 Iteration 2 Iteration 3 Iteration 4 All threads load starting from the top corner on first iteration Threads with threadIdx.x equal to 0 and 1 and threadIdx.y equal to any number from 0 to 7 load from global memory to the respective red squares in the shared memory Threads with threadIdx.x equal to any number from 0 and 7 and threadIdx.y equal to 0 to 1 load from global to the respective red squares in shared memory Threads with threadIdx.x equal to 0 or 1 and threadIdx.y equal to 0 to 1 load from global memory to the respective red squares in shared memory

10 Printf in CUDA code We’ll discuss this in class


Download ppt "Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf."

Similar presentations


Ads by Google