Shekoofeh Azizi Spring 2012 1.  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

List Ranking and Parallel Prefix

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

CUDA Programming Model

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Introduction to CUDA.

CUDA Programming Model

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu

6- General Purpose GPU Programming

Presentation transcript:

Shekoofeh Azizi Spring

 CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran code straight to GPU, no assembly language required. 2

 Development Environment  Introduction to CUDA C  CUDA programming model  Kernel call  Passing parameter  Parallel Programming in CUDA C  Example : summing vectors  Limitations  Hierarchy of blocks and threads  Shared memory and synchronizations  CUDA memory model  Example : dot product 3

 The prerequisites to developing code in CUDA C :  CUDA-enabled graphics processor  NVIDIA device driver  CUDA development toolkit  Standard C compiler 4

 Every NVIDIA GPU since the 2006 has been CUDA-enabled.  Frequently Asked Questions  How can I find out which GPU is in my computer?  Do I have a CUDA-enabled GPU in my computer? 5

 Control Panel → "NVIDIA Control Panel“ or "NVIDIA Display“ 6

 Complete list on 7

 System software that allows your programs to communicate with the CUDA-enabled hardware  Due to graphics card and OS can find on :    CUDA-enabled GPU + NVIDIA’s device driver = Run compiled CUDA C code. 8

 Two different processors  CPU  GPU  Need two compilers  One compiler will compile code for your CPU.  One compiler will compile code for your GPU  NVIDIA provides the compiler for your GPU code on:   Standard C compiler : Microsoft Visual Studio C compiler 9

 Development Environment  Introduction to CUDA C  CUDA programming model  Kernel call  Passing parameter  Parallel Programming in CUDA C  Example : summing vectors  Limitations  Hierarchy of blocks and threads  Shared memory and synchronizations  CUDA memory model  Example : dot product 10

 Host : CPU and System’s memory  Device : GPU and its memory  Kernel : Function that executes on device  Parallel threads in SIMT architecture 11

12

 An empty function named kernel() qualified with __global__  A call to the empty function, embellished with >> 13

 __global__  CUDA C needed a linguistic method for marking a function as device code  It is shorthand to send host code to one compiler and device code to another compiler.  >>  Denote arguments we plan to pass to the runtime system  These are not arguments to the device code  These will influence how the runtime will launch our device code 14

15

 Allocate the memory on the device → cudaMalloc()  A pointer to the pointer you want to hold the address of the newly allocated memory  Size of the allocation you want to make  Access memory on a device → cudaMemcpy()  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice  Release memory we’ve allocated with cudaMalloc()→ cudaFree() 16

 Restrictions on the usage of device pointer:  You can pass pointers allocated with cudaMalloc() to functions that execute on the device.  You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.  You can pass pointers allocated with cudaMalloc() to functions that execute on the host.  You cannot use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host. 17

 Development Environment  Introduction to CUDA C  CUDA programming model  Kernel call  Passing parameter  Parallel Programming in CUDA C  Example : summing vectors  Limitations  Hierarchy of blocks and threads  Shared memory and synchronizations  CUDA memory model  Example : dot product 18

 Example : Summing vectors 19

20

21 GPU Code : add >>

 Allocate 3 array on device → cudaMalloc()  Copy the input data to the device → cudaMemcpy()  Execute device code → add >> (dev_a, dev_b, dev_c)  first parameter: number of parallel blocks  second parameter: the number of threads per block  N blocks x 1 thread/block = N parallel threads  Parallel copies→ blocks 22

23

24

 Change the index computation within the kernel  Change the kernel launch 25

26

27 Grid : The collection of parallel blocks Blocks Threads

 Development Environment  Introduction to CUDA C  CUDA programming model  Kernel call  Passing parameter  Parallel Programming in CUDA C  Example : summing vectors  Limitations  Hierarchy of blocks and threads  Shared memory and synchronizations  CUDA memory model  Example : dot product 28

29

 Per block  registers  shared memory  Per thread  local memory  Per grid  Global memory  Constant memory  Texture memory 30

 __shared__  The CUDA C compiler treats variables in shared memory differently than typical variables.  Creates a copy of the variable for each block that you launch on the GPU.  Every thread in that block shares the memory  Threads cannot see or modify the copy of this variable that is seen within other blocks  Threads within a block can communicate and collaborate on computations 31

32

 The computation consists of two steps:  First, we multiply corresponding elements of the two input vectors  Second, we sum them all to produce a single scalar output.  Dot product of two four-element vectors 33

34

 Buffer of shared memory: cache→ store each thread’s running sum  Each thread in the block has a place to store its temporary result.  Need to sum all the temporary values we’ve placed in the cache.  Need some of the threads to read the values from this cache.  Need a method to guarantee that all of these writes to the shared array cache[] complete before anyone tries to read from this buffer.  When the first thread executes the first instruction after __syncthreads(), every other thread in the block has also finished executing up to the __syncthreads(). 35

 Reduction: the general process of taking an input array and performing some computations that produce a smaller array of results a.  having one thread iterate over the shared memory and calculate a running sum and take time proportional to the length of the array  do this reduction in parallel and take time that is proportional to the logarithm of the length of the array 36

 Parallel reduction:  Each thread will add two of the values in cache and store the result back to cache.  Using 256 threads per block, takes 8 iterations of this process to reduce the 256 entries in cache to a single sum.  Before read the values just stored in cache, need to ensure that every thread that needs to write to cache has already done. 37

38

1.Allocate host and device memory for input and output arrays 2.Fill input array a[] and b[] 3.Copy input arrays to device using cudaMemcpy() 4.Call dot product kernel using some predetermined number of threads per block and blocks per grid 39

Thanks Any question? 40

41