Prof. Fred Annexstein @proffreda CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.
GPU Programming with CUDA
Objective To Understand the OpenCL programming model
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Some things are naturally parallel
Slides from “PMPP” book
Introduction to CUDA C Slide credit: Slides adapted from
Parallel Computation Patterns (Scan)
Programming Massively Parallel Graphics Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Prof. Fred Annexstein @proffreda CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns Prof. Fred Annexstein @proffreda fred.annexstein@uc.edu Office Hours: 12:30-1:30 pm MW or by appointment

Parallel Cuda Applications http://devblogs.nvidia.com/parallelforall/ Deep Learning ImageNet Large Scale Visual Recognition Challenge High-Performance MATLAB  Video-creation services, 3D visualizations, game streaming Epidemic Forecasting, Climate Modeling, Cosmology

Examples of Physical Reality Behind GPU Processing

CUDA – CPU/GPU Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) . . . Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Thread grid 1 Serial Code (host) . . . Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); Thread grid 2

Arrays of Threads for Executing Kernel-level Parallelism A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions … float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; threadID 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; …

Multi-dimensional Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes

CUDA Memory Model Overview Global memory Main means of communicating R/W data between host and device Contents visible to all threads Long latency access, high bandwidth Focus on global memory for now Constant and texture memory alternative off-chip memory that will be discussed later Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory

CUDA Device Memory Allocation cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host

CUDA Device Memory Allocation (cont.) Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device GPU data structure as opposed to a host CPU data structure TILE_WIDTH = 64; float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); //cast Md as generic void ptr // return value reports errors from API call cudaFree(Md);

CUDA Host-Device Data Transfer cudaMemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfers are now possible Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory

CUDA Host-Device Data Transfer (cont.) Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

CUDA Code Basic Design Pattern In a standard CUDA application, several steps typically occur: 0) Prepare parameters on host memory 1) Allocate memory on the device (cudaMalloc()) 2) Copy data from host to device (cudaMemcpy()) 3) Perform some calculations (invoke kernel function) 4) Copy Data from Device to Host (cudaMemcpy()) 5) Free the allocated device memory (cudaFree()) 6) Rinse and Repeat

Cuda Code for Basic Pattern int main() { const unsigned int X=1048576; const unsigned int numbytes = X*sizeof(int); int *hostArray= (int*)malloc(numbytes); //Allocate 1 Megabyte int *deviceArray; cudaMalloc((void**)&deviceArray,numbytes); memset(hostArray,0,numbytes); cudaMemcpy(deviceArray,hostArray,numbytes,cudaMemcpyHostToDevice); // Typically see kernel call here int blockSize = 128; int gridSize = X / blockSize; kernel<<<gridSize,blockSize>>>(deviceArray); cudaMemcpy(hostArray,deviceArray,numbytes,cudaMemcpyDeviceToHost); cudaFree(deviceArray); }

Example Cuda Kernel Code __global__ void square(int* nums) { unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; nums[x] = nums[x] * nums[x]; }

#include<iostream> using namespace std; __global__ void square(int* nums) { unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; nums[x] = nums[x] * nums[x]; } int main() { const unsigned int X=1048576; const unsigned int numbytes = X*sizeof(int); int *hostArray= (int*)malloc(numbytes); //Allocate 1 Megabyte for (unsigned i=0; i < X; i++) hostArray[i] = (i); int *deviceArray; cudaMalloc((void**)&deviceArray,numbytes); cudaMemcpy(deviceArray,hostArray,numbytes,cudaMemcpyHostToDevice); int blockSize = 128; int gridSize = X / blockSize; square<<<gridSize,blockSize>>>(deviceArray); cudaMemcpy(hostArray,deviceArray,numbytes,cudaMemcpyDeviceToHost); for (unsigned i = 0 ; i < 10; i++) cout << hostArray[i] << endl; cudaFree(deviceArray); } $nvcc main.cu

Part 2: Parallel Programming Theory • Data Parallel Programming Model • Common Communication Patterns Map, Scatter, Gather, Stencil Work and Step Complexity Analysis • Basic Parallel Operations and Algorithms Reduce, Scan, Histogram Algorithmic Language Descriptions PRAM and Recursion

Data Parallel Programming Model GPU/CUDA Model: Arrays of lightweight parallel thread deployed by invoking a kernel function Kernel calls specify control andsize of thread blocks -- blockDim and gridDim indexing of threads -- threadIdx and blockIdx Memory management issues complicate model Global memory, shared memory, device and host arrays

The Map Operation A basic function on arrays that takes another function as argument map applies function arg as operation on each array element Built natively into python and other functional PLs >>> map(add1, [1, 2 ,3]) [2, 3, 4] In CUDA a map is code that assigns each thread in an array the same task on its own local memory

Scatter and Gather Communication Operations Scatter is an data communication operation that moves or permutes data based on an array of location addresses Gather is a operation that specifies that each thread does a computation using data from multiple locations

Stencil Communication Operation Given a fixed sized stencil window to cover and centered on every array element Applications often use small 2-D stencil applied to large 2D image array. Common class of applications related to convolution filtering

Work and Step Complexity Work complexity is defined as the total number of operations executed by a computation can assume only one processor, since more can not lower the work Step (or depth) complexity is defined as the longest chain of sequential dependencies in the computation can assume infinite number of parallel processors, since fewer can increase the number of steps

Work and Step Complexity Example Consider, for example, summing 16 numbers using a balanced binary tree. The work required by this computation is 15 operations (the 15 additions). With 1 step reduce size to 8, with 2 steps reduce size to 4, with 3 steps reduce to size 2, with 4 steps reduce to 1. So the step complexity is 4 operations, since the longest chain of dependencies is the depth of the summation tree. In general summing n numbers using a balanced tree pattern requires n-1 work and log n steps .

Example: Quicksort Quicksort is an algorithm that is not hard to parallelize. Recall the method used by quicksort We can execute two recursive calls in parallel And, all the pivot comparisons can be done in parallel. Why?

A Recursive Language Description of Quicksort def qsort(list): if list == []: return [] else: pivot = list[0] lesser = [x for x in list[1:] if x < pivot] greater = [x for x in list[1:] if x >= pivot] return qsort(lesser) + [pivot] + qsort(greater)

Complexity Analysis of Quicksort Informally, let us consider the work and step complexity of quicksort Work complexity = Step complexity =

The Reduce Operation sum_reduce(x) is a function that takes a list and returns the sum of the elements. Described as a simple recursive function and related to the binary tree reduction seen above. Write a recursive (python) function for summing def sum_reduce(x): if len(x)== 1: return else: sumleft = sum_reduce (x[0:len(x)/2] ) sumright = sum_reduce (x[len(x)/2 +1:]) return sumleft + sumright

Work and Step Complexity of Reduce using Recurrence Relations W(n) = 2 * W(n/2) + 1; W(1) = 0 S(n) = S(n/2) + 1 ; S(1) = 0

Parallel Reduction Kernel in Cuda

Udacity cs344 https://www.udacity.com/course/cs344 https://www.udacity.com/wiki/cs344 https://github.com/udacity/cs344

Homework #1