ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
GRAPHIC PROCESSING UNITS Razvan Bogdan Microprocessor Systems.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
GPU Programming with CUDA
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
Basic CUDA Programming
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Some things are naturally parallel
Slides from “PMPP” book
Introduction to CUDA C Slide credit: Slides adapted from
CUDA Parallelism Model
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding with a simple but complete CUDA program

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 3 CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

CUDA/OpenCL- Execution Model Reflects a multicore processor + GPGPU execution model Addition of device functions and declarations to stock C programs Compilation separated into host and GPU paths

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Virtual ISA Heterogeneous Computing Platform with CPUs, GPUs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

CUDA /OpenCL – Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Introduction of the notion of grids of threads Asynchronous vs. synchronous execution Lightweight threads in a kernel – distinguished from CPU threads Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Thread Hierarchy All threads execute the same kernel code Memory hierarchy is tied to the thread hierarchy (later) From http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy

CUDA /OpenCL Programming Model: The Grid Each kernel User Application N-Dimensional Range Host Tasks GPU Tasks 3D thread block Software Stack Operating System CPU GPU gridDim.y gridDim.z gridDim.x Note: Each thread executes the same kernel code!

Vector Addition – Conceptual View A[N-1] … vector B B[0] B[1] B[2] B[3] B[4] B[N-1] … + + + + + + vector C C[0] C[1] C[2] C[3] C[4] C[N-1] … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

A Simple Thread Block Model blockdim threadIdx = 0 threadIdx = blockDim-1 blockIdx = 0 blockIdx = 1 blockIdx = 2 blockIdx = gridDim-1 id = blockIdx * blockDim + threadIdx; Structure an array of threads into thread blocks A unique id can be computed for each thread This id is used for workload partitioning

Using Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads All threads in a grid run the same kernel code (SPMD)‏ Each thread has an index that it uses to compute memory addresses and make control decisions 1 2 254 255 … i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; Thread blocks can be multidimensional arrays of threads … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Thread Blocks: Scalable Cooperation Divide thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate (only through global memory) Thread Block 0 Thread Block 1 Thread Block N-1 1 2 … 254 255 1 2 … 254 255 1 2 … 254 255 i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; … … … … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition – Traditional C Code // Compute vector sum C = A+B void vecAdd(float* A, float* B, float* C, int n) { for (i = 0, i < n, i++) C[i] = A[i] + B[i]; } int main() // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements … vecAdd(A_h, B_h, C_h, N); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Heterogeneous Computing vecAdd Host Code Part 1 #include <cuda.h> void vecAdd(float* A, float* B, float* C, int n)‏ { int size = n* sizeof(float); float* A_d, B_d, C_d; … 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors } Host Memory Device Memory CPU GPU Part 2 Part 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Partial Overview of CUDA Memories Device code can: R/W per-thread registers R/W per-grid global memory Host code can Transfer data to/from per grid global memory (Device) Grid Block (0, 0) Block (1, 0) Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global, constant, and texture memory spaces are persistent across kernels called by the same application. Global Memory Host We will cover more later. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

CUDA Device Memory Management API functions cudaMalloc() Allocates object in the device global memory Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes cudaFree() Frees object from device global memory Pointer to freed object Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Host-Device Data Transfer API functions cudaMemcpy() memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Transfer to device is asynchronous (Device) Grid Block (0, 0) Block (1, 0) Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory Host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

void vecAdd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &C_d, size); 2. // Kernel invocation code – to be shown later … 3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vectAdd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddkernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vecAdd(float* A, float* B, float* C, int n) // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernnel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n); Host Code Execution Configuration © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

More on Kernel Launch Host Code Edge effects int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n); } Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking  cudaDeviceSynchronize(); Host Code Edge effects © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Kernel Execution in a Nutshell __host__ Void vecAdd() { dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n); } __global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } Kernel Blk 0 Blk N-1 • • • Note that the description/launch of parallelism is independent of the number of SMs  scalable portable execution Actual performance is dependent on the actual number of SMs in a scalable manner GPU M0 RAM Mk • • • Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

More on CUDA Function Declarations host __host__ float HostFunc()‏ device __global__ void KernelFunc()‏ __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function Each “__” consists of two underscore characters A kernel function must return void __device__ and __host__ can be used together © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

CPU and GPU Address Spaces X= X= GPU CPU PCIe Requires explicit management in each address space Programmer initiated transfers between address spaces

Unified Memory All transfers managed under the hood X= Single unified address space across CPU and GPU GPU CPU PCIe All transfers managed under the hood Smaller code segments Explicit management now becomes an optimization

Using Unified Memory #include <iostream> #include <math.h> // Kernel function to add the elements of two arrays __global__ void add(int n, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } int main(void) { int N = 1<<20; float *x, *y; // Allocate Unified Memory – accessible from CPU or GPU cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float)); // initialize x and y arrays on the host for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } // Run kernel on 1M elements on the GPU add<<<1, 1>>>(N, x, y); // Wait for GPU to finish cudaDeviceSynchronize(); // Free memory cudaFree(x); cudaFree(y); return 0; } From https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

The ISA An Instruction Set Architecture (ISA) is a contract between the hardware and the software. As the name suggests, it is a set of instructions that the architecture (hardware) can execute. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

PTX ISA – Major Features Set of predefined, read only variables E.g., %tid (threadIdx), %ntid (blockDim), %ctaid (blockIdx), %nctaid(gridDim), etc. Some of these are multidimensional E.g., %tid.x, %tid.y, %tid.z Includes architecture variables E.g., %laneid, %warpid Can be used for auto-tuning of code Includes undefined performance counters © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

PTX ISA – Major Features (2) Multiple address spaces Register, parameter Constant global, etc. More later Predicated instruction execution © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Versions Compute capability CUDA version PTX version Architecture specific, e.g., Volta is 7.0 CUDA version Determines programming model features Currently 9.0 PTX version Virtual ISA version Currently 6.1

Questions? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign