Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Slides:

Advertisements

Similar presentations

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Advertisements

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Grids, Blocks, and Threads

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

GPU Programming with CUDA – Optimisation Mike Griffiths

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CUDA C/C++ Basics Part 2 - Blocks and Threads

GPU Memories These notes will introduce:

Basic CUDA Programming

Some things are naturally parallel

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CUDA Parallelism Model

More on GPU Programming

CUDA Grids, Blocks, and Threads

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Introduction to CUDA.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

CUDA Grids, Blocks, and Threads

General Purpose Graphics Processing Units (GPGPUs)

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano CUDA Politecnico di Milano N.1.1 12 December, 2017 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Introduction CUDA (Compute Unified Device Architecture) has been introduced in 2007 with NVIDIA Tesla architecture C-like language to write programs that are able to use GPU as a processing resource to accelerate functions CUDA’s abstraction level closely match the characteristics and organization of NVIDIA GPUs

Structure of a program A CUDA program is organized in A serial C code executed on the CPU A set of independent kernel functions parallelized on the GPU

Parallelization of a kernel The set of input data is decomposed into a stream of elements A single computational function (kernel) operates on each element “thread” defined as execution of kernel on one data element Multiple cores can process multiple threads in parallel Suitable for data-parallel problems

CUDA hierarchy of threads A kernel executes in parallel across a set of parallel threads Each thread executes an instance of the kernel on a single data chuck Threads are grouped in thread blocks Thread blocks are grouped in a grid Blocks and grids are organized as an N-dimensional (up to 3) array IDs are used to identify elements in the array

CUDA hierarchy of threads CUDA hierarchy of threads perfectly match with NVIDIA 2-level architecture A GPU executes one or more kernel grids A streaming multiprocessor (SM) executes concurrently one or more interleaved thread blocks CUDA cores in the SM execute threads The SM executes threads of the same block in groups of 32 threads called a warp Block Block

CUDA device memory model There are mainly three distinct memories visible in the device-side code sequential

CUDA device memory model Matching between the memory model and the NVIDIA GPU architecture

Execution flow 1. Copy input data from CPU memory to GPU memory

Execution flow 2. Load GPU program and execute

Execution flow 3. Transfer results from GPU memory to CPU memory

Example of CUDA kernel Compute the sum of two vectors C program: #define N 1024 void vsum(int* a, int* b, int* c){ int i; for (i=0;i<N;i++){ c[i]=a[i]+b[i]; } void main(){ int va[N], vb[N], vc[N]; ... vsum(va,vb,vc);

Example of CUDA kernel CUDA C program: #define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); }

Example of CUDA kernel CUDA C program: Get the x (y, z) value of the block id Get the length of the block on the x (y, z) direction CUDA C program: #define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Get the x value of the thread id on x (y, z) direction void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); }

Example of CUDA kernel CUDA C program: Number of blocks in the grid #define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Number of blocks in the grid Number of threads in each block Execute the kernel void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); }

Data transfer GPU memory is separate from CPU one GPU memory has to be allocated Data have to be transferred explicitly in GPU memory void main(){ int *va; int mya[N]; //to be populated ... cudaMalloc(&va, N*sizeof(int)); cudaMemcpy(va, mya, N*sizeof(int), cudaMemcpyHostToDevice); ... //execute kernel cudaMemcpy(mya, va, N*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(va); }

Host/device synchronization Kernel calls are non-blocking. This means that the host program continues immediately after it calls the kernel Allows overlap of computation on CPU and GPU Use cudaThreadSynchronize() to wait for kernel to finish vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(a, b, c); //do work on host (that doesn’t depend on c) cudaThreadSynchronise(); //wait for kernel to finish Standard cudaMemcpy calls are blocking Non-blocking variants exist It is also possible to synchronize threads within the same block with the __syncthreads()call

Another example of CUDA kernel Sum of matrices: 2D decomposition of the problem __global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N]) { int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; } int main() dim3 blocksPerGrid(N/16,N/16,1); // (N/16)x(N/16) blocks/grid (2D) dim3 threadsPerBlock(16,16,1); // 16x16=256 threads/block (2D) matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);

References NVIDIA website: http://www.nvidia.com/object/what-is-gpu-computing.html