Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Fred Annexstein @proffreda CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Similar presentations


Presentation on theme: "Prof. Fred Annexstein @proffreda CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns."— Presentation transcript:

1 Prof. Fred Annexstein @proffreda
CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns Prof. Fred Annexstein @proffreda Office Hours: 12:30-1:30 pm MW or by appointment

2 Parallel Cuda Applications
Deep Learning ImageNet Large Scale Visual Recognition Challenge High-Performance MATLAB  Video-creation services, 3D visualizations, game streaming Epidemic Forecasting, Climate Modeling, Cosmology

3 Examples of Physical Reality Behind GPU Processing

4 CUDA – CPU/GPU Integrated host+device app C program
Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) . . . Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Thread grid 1 Serial Code (host) . . . Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); Thread grid 2

5 Arrays of Threads for Executing Kernel-level Parallelism
A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1

6 Thread Blocks: Scalable Cooperation
Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 float x = input[threadID]; float y = func(x); output[threadID] = y; float x = input[threadID]; float y = func(x); output[threadID] = y;

7 Multi-dimensional Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes

8 CUDA Memory Model Overview
Global memory Main means of communicating R/W data between host and device Contents visible to all threads Long latency access, high bandwidth Focus on global memory for now Constant and texture memory alternative off-chip memory that will be discussed later Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory

9 CUDA Device Memory Allocation
cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host

10 CUDA Device Memory Allocation (cont.)
Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device GPU data structure as opposed to a host CPU data structure TILE_WIDTH = 64; float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); //cast Md as generic void ptr // return value reports errors from API call cudaFree(Md);

11 CUDA Host-Device Data Transfer
cudaMemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfers are now possible Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory

12 CUDA Host-Device Data Transfer (cont.)
Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

13 CUDA Code Basic Design Pattern
In a standard CUDA application, several steps typically occur: 0) Prepare parameters on host memory 1) Allocate memory on the device (cudaMalloc()) 2) Copy data from host to device (cudaMemcpy()) 3) Perform some calculations (invoke kernel function) 4) Copy Data from Device to Host (cudaMemcpy()) 5) Free the allocated device memory (cudaFree()) 6) Rinse and Repeat

14 Cuda Code for Basic Pattern
int main() { const unsigned int X= ; const unsigned int numbytes = X*sizeof(int); int *hostArray= (int*)malloc(numbytes); //Allocate 1 Megabyte int *deviceArray; cudaMalloc((void**)&deviceArray,numbytes); memset(hostArray,0,numbytes); cudaMemcpy(deviceArray,hostArray,numbytes,cudaMemcpyHostToDevice); // Typically see kernel call here int blockSize = 128; int gridSize = X / blockSize; kernel<<<gridSize,blockSize>>>(deviceArray); cudaMemcpy(hostArray,deviceArray,numbytes,cudaMemcpyDeviceToHost); cudaFree(deviceArray); }

15 Example Cuda Kernel Code
__global__ void square(int* nums) { unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; nums[x] = nums[x] * nums[x]; }

16 #include<iostream> using namespace std; __global__ void square(int* nums) { unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; nums[x] = nums[x] * nums[x]; } int main() { const unsigned int X= ; const unsigned int numbytes = X*sizeof(int); int *hostArray= (int*)malloc(numbytes); //Allocate 1 Megabyte for (unsigned i=0; i < X; i++) hostArray[i] = (i); int *deviceArray; cudaMalloc((void**)&deviceArray,numbytes); cudaMemcpy(deviceArray,hostArray,numbytes,cudaMemcpyHostToDevice); int blockSize = 128; int gridSize = X / blockSize; square<<<gridSize,blockSize>>>(deviceArray); cudaMemcpy(hostArray,deviceArray,numbytes,cudaMemcpyDeviceToHost); for (unsigned i = 0 ; i < 10; i++) cout << hostArray[i] << endl; cudaFree(deviceArray); } $nvcc main.cu

17 Part 2: Parallel Programming Theory
• Data Parallel Programming Model • Common Communication Patterns Map, Scatter, Gather, Stencil Work and Step Complexity Analysis • Basic Parallel Operations and Algorithms Reduce, Scan, Histogram Algorithmic Language Descriptions PRAM and Recursion

18 Data Parallel Programming Model
GPU/CUDA Model: Arrays of lightweight parallel thread deployed by invoking a kernel function Kernel calls specify control andsize of thread blocks -- blockDim and gridDim indexing of threads -- threadIdx and blockIdx Memory management issues complicate model Global memory, shared memory, device and host arrays

19 The Map Operation A basic function on arrays that takes another function as argument map applies function arg as operation on each array element Built natively into python and other functional PLs >>> map(add1, [1, 2 ,3]) [2, 3, 4] In CUDA a map is code that assigns each thread in an array the same task on its own local memory

20 Scatter and Gather Communication Operations
Scatter is an data communication operation that moves or permutes data based on an array of location addresses Gather is a operation that specifies that each thread does a computation using data from multiple locations

21 Stencil Communication Operation
Given a fixed sized stencil window to cover and centered on every array element Applications often use small 2-D stencil applied to large 2D image array. Common class of applications related to convolution filtering

22 Work and Step Complexity
Work complexity is defined as the total number of operations executed by a computation can assume only one processor, since more can not lower the work Step (or depth) complexity is defined as the longest chain of sequential dependencies in the computation can assume infinite number of parallel processors, since fewer can increase the number of steps

23 Work and Step Complexity Example
Consider, for example, summing 16 numbers using a balanced binary tree. The work required by this computation is 15 operations (the 15 additions). With 1 step reduce size to 8, with 2 steps reduce size to 4, with 3 steps reduce to size 2, with 4 steps reduce to 1. So the step complexity is 4 operations, since the longest chain of dependencies is the depth of the summation tree. In general summing n numbers using a balanced tree pattern requires n-1 work and log n steps .

24 Example: Quicksort Quicksort is an algorithm that is not hard to parallelize. Recall the method used by quicksort We can execute two recursive calls in parallel And, all the pivot comparisons can be done in parallel. Why?

25 A Recursive Language Description of Quicksort
def qsort(list): if list == []: return [] else: pivot = list[0] lesser = [x for x in list[1:] if x < pivot] greater = [x for x in list[1:] if x >= pivot] return qsort(lesser) + [pivot] + qsort(greater)

26 Complexity Analysis of Quicksort
Informally, let us consider the work and step complexity of quicksort Work complexity = Step complexity =

27 The Reduce Operation sum_reduce(x) is a function that takes a list and returns the sum of the elements. Described as a simple recursive function and related to the binary tree reduction seen above. Write a recursive (python) function for summing def sum_reduce(x): if len(x)== 1: return else: sumleft = sum_reduce (x[0:len(x)/2] ) sumright = sum_reduce (x[len(x)/2 +1:]) return sumleft + sumright

28 Work and Step Complexity of Reduce using Recurrence Relations
W(n) = 2 * W(n/2) + 1; W(1) = 0 S(n) = S(n/2) + 1 ; S(1) = 0

29 Parallel Reduction Kernel in Cuda

30

31 Udacity cs344 https://www.udacity.com/course/cs344

32 Homework #1


Download ppt "Prof. Fred Annexstein @proffreda CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns."

Similar presentations


Ads by Google