Parallel Computing 18: CUDA - I

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GRAPHIC PROCESSING UNITS Razvan Bogdan Microprocessor Systems.

Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

GPU Programming with CUDA

Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Basic CUDA Programming

Some things are naturally parallel

Slides from “PMPP” book

Introduction to CUDA C Slide credit: Slides adapted from

CUDA Parallelism Model

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CUDA Grids, Blocks, and Threads

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu,

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Presentation transcript:

Parallel Computing 18: CUDA - I Mohamed Zahran NYU

Parallel Computing on a GPU GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications – Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Data parallelism Programmable in C (and other languages) GeForce Titan X with CUDA tools (and Opencl too). Multithreaded SPMD model uses application data parallelism and thread parallelism.

[SPMD = Single Program Multiple Data] Source: NVIDIA CUDA C Programming

Source: Multicore and GPU Programming: An Integrated Approach by G Source: Multicore and GPU Programming: An Integrated Approach by G. Barlas

CUDA Compute Unified Device Architecture Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code

Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); . . . Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code (the SP in SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions

… i = blockIdx .x * blockDim + threadIdx ; C_d [ i = ] A_d B_d 1 2 254 1 2 254 255

Thread Blocks Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization, … Threads in different blocks cannot cooperate Divide monolithic thread array into multiple blocks i = blockIdx.x * blockDim.x + threadIdx.x ; C_d [ i = ] A_d B_d … 1 2 254 255 … 1 2 254 255 i = blockIdx.x * blockDim.x + threadIdx.x ; C_d [ i ] = A_d B_d 8

of the grid in number of blocks Kernel Launched by the host Very similar to a C function To be executed on device All threads will execute that same code in the kernel. Grid 1D or 2D organization of a Grid 3D available in recent GPUs gridDim.x, gridDim.y, gridDim.z are the size Block Block is assigned to an SM blockDim.x, blockDim.y, blockDim.z 1D, 2D, or 3D organization of a block Thread are block dimensions counted as number of threads of the grid in number of blocks threadIdx.x, threadIdx.y, threadIdx.z

IDs Each thread uses IDs to decide what data to work on Host Kernel 1 2 Device Grid 1 Block (0 , 0 ) (1 1 ) Grid 2 Courtesy: NDVIA Block (1, 1) Thread ,1, 0) (2 (3 ,0, 1) Each thread uses IDs to decide what data to work on Block ID: 1D or 2Dor 3D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Image processing –Solving PDEs on volumes

A Simple Example: Vector Addition ] vector A vector B vector C 1 2 3 4 A[N - B[ … B[N C[ C[N + Compute vector sum C = A+B void vecAdd(float* A, float* B, float* C, int n)

A Simple Example: Vector Addition GPU friendly! { for ( i = , < n, ++) C[ ] = A [ + B ; } int main() { // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements … vecAdd(A_h, B_h, C_h, N); }

A Simple Example: Vector Addition #include <cuda.h> Part 1 void vecAdd(float* A, float* B, float* C, int n) { int size = n* sizeof(float); float* A_d, B_d, C_d; … // Allocate device memory for A, B, and C // copy A and B to device memory // Kernel launch code – to have the device Part 3 // to perform the actual vector addition // copy C from the device memory // Free device vectors } CPU Host Memory GPU Part 2 Device Memory

CUDA Memory Model Global memory Contents visible to all threads – Long latency access Shared memory: Per SM Shared by all threads in a block Global memory Main means of communicating R/W Data between host and device Grid Global Memory Block ( , ) Shared Memory Thread ( Registers 1 Host

CPU & GPU Memory In CUDA, host and devices have separate memory spaces. But in recent GPUs we have Unified Memory Access If GPU and CPU are on the same chip, then they share memory space à fusion

CUDA Device Memory Allocation cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object Grid Block (0,0) Block(1,,0) Block(0,0) Shared Memory Thread(0,0) Registers Thread(1,0) Block(1,0) host cudaFree() Host Frees object from device Global Memory Pointer to freed object

CUDA Device Memory Allocation Grid Global Memory Block ( , ) Shared Memory Thread ( Registers 1 Host Example: WIDTH = 64; float* Md; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

CUDA Device Memory Allocation Grid Global Memory Block ( , ) Shared Memory Thread ( Registers 1 Host cudaMemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device – Device to Host Device to Device Asynchronous transfer Important!cudaMemcpy() canno be used tocopy between different GPUs in multi-GPUs system

CUDA Device Memory Allocation Example: Destination Source Size Pointer pointer in bytes Grid Global Memory Block ( , ) Shared Memory Thread ( Registers 1 Host Direction cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Note About Error Handling Almost all API calls return success or failure. The type of the outcome is: cudError_t Success à cudaSuccess Translate the error code to an error message: char * cudaGetErrorString (cudaError_t error)

A Simple Example: Vector Addition void vecAdd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, * B_d, * C_d; // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &C_d, size); // Kernel invocation code – to be shown later How to launch a kernel? … // Transfer C from device to hostcudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d); }

int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernnel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n); } #blocks #threads/blks

// Each thread performs one pair-wise addition __global__ void vecAddkernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; }

Unique ID 1D grid of 1D blocks blockIdx.x *blockDim.x + threadIdx.x;

Unique ID 1D grid of 2D blocks blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x; blockDim.x A Block blockDim.y

Unique ID 1D grid of 3D blocks blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim + threadIDx.x

Unique ID 2D grid of 1D blocks int blockId = blockIdx.y * gridDim.x + blockIdx.x; int threadId = blockId * blockDim.x + threadIdx.x;

Unique ID 2D grid of 2D blocks int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;

Unique ID 2D grid of 3D blocks int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x

Unique ID 3D grid of 1D blocks int blockId = blockIdx.x + blockIdx.y * gridDim.x+ gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * blockDim.x + threadIdx.x;

Unique ID 3D grid of 2D blocks int blockId = blockIdx.x+ blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;

Unique ID 3D grid of 3D blocks int blockId = blockIdx.x+ blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;

Only callable from the: Kernels Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function. Must return void __device__ and __host__ can be used together For functions executed on the device: No static variable declarations inside the function No indirect function calls through pointers

The Hello World of Parallel Programming: Matrix Multiplication 33 M N P WIDTH Data Parallelism: We can safely perform many arithmetic operations on the data structures in a simultaneous manner.

The Hello World of Parallel Programming: Matrix Multiplication C adopts raw-major placement approach when storing 2D matrix in linear memory address.

The Hello World of Parallel Programming: Matrix Multiplication A Simple main function: executed at the host

The Hello World of Parallel Programming: Matrix Multiplication 36 M N P WIDTH i k j // Matrix multiplication on the (CPU) host void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = sum += a * b; } P[j * Width + j] = sum;

The Hello World of Parallel Programming: Matrix Multiplication

The Hello World of Parallel Programming: Matrix Multiplication The Kernel Function 38 Nd Md Pd WIDTH ty tx k

More On Specifying Dimensions // Setup the execution configuration dim3 dimGrid(x, y, z); dim3 dimBlock(x, y, z); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd,Width); Important:dimGrid and dimBlock are user defined gridDim and blockDim are built-in predefined variable accessible in kernel functions

Be Sure To Know: Maximum dimensions of a block Maximum number of threads per block Maximum dimensions of a grid Maximum number of blocks per grid

Tools NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Heterogeneous Computing Platform with GPUs, CPUS

Conclusions Data parallelism is the main source of scalability for parallel programs Each CUDA source file can have a mixture of both host and device code. What we learned today about CUDA: KernelA<<< nBlk, nTid >>>(args) cudaMalloc() cudaFree() cudaMemcpy() gridDim and blockDim threadIdx and blockIdx dim3