© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 2: The CUDA Programming Model.

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

GPU Architecture and Programming

Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

L16: Introduction to CUDA November 1, 2012 CS4230.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Martin Kruliš by Martin Kruliš (v1.0)1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.

CS6963 L2: Introduction to CUDA January 14, L2:Introduction to CUDA.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GRAPHIC PROCESSING UNITS Razvan Bogdan Microprocessor Systems.

Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

© David Kirk/NVIDIA and Wen-mei W

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Slides from “PMPP” book

A lighthearted introduction to GPGPUs Dr. Keith Schubert

Introduction to CUDA C Slide credit: Slides adapted from

CUDA Parallelism Model

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Parallel Computing 18: CUDA - I

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

A[0] vector A vector B vector C A[1]A[2]A[3]A[4]A[N-1] B[0]B[1]B[2]B[3] … B[4] … B[N-1] C[0]C[1]C[2]C[3]C[4]C[N-1] … Vector Addition – Conceptual View 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

3 void vecAdd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size; cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &C_d, size); 2. // Kernel invocation code – to be shown later … 3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d); }

4 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vectAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel >>(A_d, B_d, C_d, n); } Example: Vector Addition Kernel Device Code

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign Integer Number of Thread Blocks For a vector of 100 elements –256*4 = 1024 –The last thread block has 24 threads that have no elements to process i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; … Thread Block 0 … Thread Block 1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; … Thread Block N-1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C_d[i] = A_d[i] + B_d[i]; … ……… 5

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddkernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernnel >>(A_d, B_d, C_d, n); } Host Code 6

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign More on Kernel Launch int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernnel >>(A_d, B_d, C_d, n); } Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking Host Code 7

__global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } __host__ Void vecAdd() { dim3 DimGrid = (n/256,1,1; if (n%256) DimGrid.x++; dim3 DimBlock = (256,1,1); vecAddKernel >> (A_d,B_d,C_d,n); } Kernel execution in a nutshell Kernel Blk 0 Blk N-1 GPU M0 RAM Mk Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 8

More on CUDA Function Declarations host __host__ float HostFunc()‏ hostdevice __global__ void KernelFunc()‏ device __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function Each “__” consists of two underscore characters A kernel function must return void __device__ and __host__ can be used together 9

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign Integrated C programs with CUDA extensions NVCC Compiler Host C Compiler/ Linker Host CodeDevice Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs Compiling A CUDA Program 10