+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
CUDA Grids, Blocks, and Threads
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
GPU computing and CUDA Marko Mišić
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Killdevil Running CUDA programs on cluster. Requesting permission bin/unc_id/services bin/unc_id/services.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Computer Engg, IIT(BHU)
GPU Computing CIS-543 Lecture 03: Introduction to CUDA
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Presentation transcript:

+ CUDA Antonyus Pyetro do Amaral Ferreira

+ The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. The challenge is to develop application software that transparently scales its parallelism.

+ A solution CUDA is a parallel programming model and software environmenr. A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.

+ CPU vs. GPU

+ The GPU is especially well-suited to address problems that can be expressed as data-parallel computations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control.

+ Applications? General signal processing or physics simulation to computational finance or computational biology. The latest generation of NVIDIA GPUs, based on the Tesla architecture, supports the CUDA programming model

+ CUDA - Hello world

+ What is CUDA? CUDA extends C by allowing the programmer to define C functions, called kernels,that are executed N times in parallel by N different CUDA threads. Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

+ CUDA Sum of vectors

+ Concurrency Threads within a block can cooperate among themselves by sharing data through some shared memory. __syncthreads() acts as a barrier at which all threads in the block must wait before any are allowed to proceed.

+ Process Hierarchy

+ Memory Hierarchy Per-thread local memory Per-block shared memory

+ Memory Hierarchy Global Memory

+ Host and Device CUDA assumes that the CUDA threads may execute on a physically separate device that operates as a coprocessor to the host. CUDA also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory

+ Software Stack CUDA Libraries CUDA Runtime CUDA Driver ApplicationApplication HostHost DeviceDevice

+ Language Extensions Function type qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device. Variable type qualifiers to specify the memory location on the device of a variable.

+ Language Extensions A new directive to specify how a kernel is executed on the device from the host. vecAdd >>(A, B, C); Four built-in variables that specify the grid and block dimensions and the block and thread indices

+ Function Type Qualifiers __device__ Executed on the device Callable from the device only. __global__ Executed on the device Callable from the host only. __host__ Executed on the host Callable from the host only. Default type

+ Variable Type Qualifiers __device__ global memory space Is accessible from all the threads within the grid __constant__ constant memory space Is accessible from all the threads within the grid __shared__ space of a thread block Is only accessible from all the threads within the block

+ Execution Configuration Any call to a __global__ function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device.

+ Execution Configuration >> Dg is of type dim3 and specifies the dimension and size of the grid, such that Dg.x * Dg.y equals the number of blocks being launched; Dg.z is unused; Db is of type dim3 and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block; Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array; Ns is an optional argument which defaults to 0; S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.

+ Built-in Variables gridDim This variable is of type dim3 and contains the dimensions of the grid. blockIdx This variable is of type uint3 and contains the block index within the grid. blockDim This variable is of type dim3 and contains the dimensions of the block. threadIdx This variable is of type uint3 and contains the thread indexwithin the block.

+ Example – Matrix multiplication Task: C = A(hA,wA) X B(wA, wB) Each thread block is responsible for computing one square sub-matrix Csub of C; Each thread within the block is responsible for computing one element of Csub.

+ Example – Matrix multiplication Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension (wA, block_size) and the sub-matrix of B of dimension (block_size, wA) these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary. Csub is computed as the sum of the products of these square matrices.

+ Example – Matrix multiplication

+ Compilation with NVCC MS – visual studio 2005: tools>options>Project&solutions >VC++ directories>include files\libraries files Point to: C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc

+ CUDA Interoperability OpenGL Direct3D