Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Programming Massively Parallel Graphics Processors

Lecture 2: Intro to the simd lifestyle and GPU internals

Some things are naturally parallel

Linchuan Chen, Xin Huo and Gagan Agrawal

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Operation System Program 4

MASS CUDA Performance Analysis and Improvement

NVIDIA Fermi Architecture

Programming Massively Parallel Graphics Processors

Mattan Erez The University of Texas at Austin

General Purpose Graphics Processing Units (GPGPUs)

Mattan Erez The University of Texas at Austin

Graphics Processing Unit

Synchronization These notes introduce:

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation

Outline Brief introduction to what is CUDA What is a CUDA code look like How CUDA run CUDA detail Optimization

What is CUDA A architecture for programming GPU Released by NVIDIA first in 2007 C / C++ extension There are another GPU programming architectures before CUDA, ex: Cg, Brooks

Applications Bioinformatics Computational Finance Deep learning etc

GPGPU GPU becomes more powerful General Purpose GPU (GPGPU) Computing power Memory bandwidth (on chip) General Purpose GPU (GPGPU) Hundreds of thousands of cores inside, run threads concurrently. A core is not powerful than CPU, but Two hands are better than one

Hardware Requirement To run CUDA, your GPU need to meet some requirements Architecture compatible: AMD GPUs are not compatible now; instead, they release Boltzmann Initiative NVIDIA GPU cards need architecture after Fermi architecture: latter generation architectures support more CUDA features

Terminology Host: The CPU and it’s memory Device: The GPU and it’s memory

Heterogeneous Computing Serial Code Parallel Code Serial Code

Processing Flow Copy data to GPU memory Execute hundreds of thousands of threads in GPU Copy result back to CPU Kernel: the code GPU will execute

Hello World! Our first simple example: simple_add Parallel code: kernel

Serial code: setup

Serial code: launch kernel collect result

d_c is just a pointer points to a device memory, store the result of add If we need to use the value of d_c, we just copy it back with cudaMemcpy, then continue our serial (CPU) code If we direct use *d_c, because it stores device memory address, you’ll get a

><><><><><><><><>< To launch kernel, we execute a __global__ function with <<<>>> simple_add<<<1, 1>>>(d_a, d_b, d_c); Decide how much thread to launch Kernel function Function parameters

Thread, Block, Grid We saw <<<1, 1>>> decide to launch 1 thread, but what does <<<1, 1>>> mean? CUDA use hierarchy structure to manage threads: grid, block, thread

Grid Block 0 Block 1 Block 2 Block … thread 0 thread 1 thread 2

Grid can be consisted as at most 3D blocks Blocks can be consisted as at most 3D threads

<<<blocks pre grid, threads pre block>>> <<<1, 1>>> : a grid with 1 block inside, and one block is consisted of 1 thread. Total threads: 1 <<<2, 3>>>: a grid with 2 blocks inside, and one block is consisted of 3 threads. Total threads: 6 Why this kind of management? We’ll talk about it later

vector_add The simple_add example is boring, let’s do a more interesting example threadIdx.x is the x number of the thread

BlockIdx.x is the x number of the block BlockDim.x is the total threads in x dimension (width) If we launch vector_add<<<2, 3>>> For first thread (block(0), thread(0)): idx = 0 + 0 * 3 = 0 For fourth thread (block(1), thread(0)): idx = 0 + 1 * 3 = 3

We have added the vector concurrently! Remember to malloc right size Also, memcpy with the right size

Memory Types CUDA has 5 types of memory, each of them has different properties Key properties: Size Access speed Read/write, read only

Memory Types Global memory: cudaMalloc memory, the size is large, but slow (has cache) Texture memory: read only, cache optimized for 2D access pattern Constant memory: slow but with cache (8KB)

Memory Types Global memory is accessible to all threads once the kernel call pass the pointer points to it Constant memory is accessible to all threads even without passing pointer to the kernel Texture memory is the same as texture memory

Memory Types Local memory: local to thread, but is as slow as global memory Shared memory: 100x fast to global memory, but is accessible to all threads in one block

Memory Types Shared memory is very fast, but usually only has 49KB (can be configured to 64KB) Actually, shared memory is the same as “L1 cache” as CPU, but controllable by user One block has one shared memory, that’s one reason why we manage the threads in grid and block way!

Shared Memory Example 1D stencil:

If we program this way Straight forward, but very slow!

With shared memory: With this little change, we can reduce about 2*RADIUS latency!

But actually, the result is wrong… If thread 5 done copy, and starts calculate (tmp[3] + tmp[4] + tmp[5]), but thread 6 did not finish copy tmp[5]? __syncthreads() makes all threads in a block synchorize! __syncthreads() is a barrier, will block all threads wait until all threads reach the line

Then we have the correct 1D stencil

Reduction CUDA is running multi-threads, like Hadoop (map and reduce), you can do the reduce to do: Summation Search Etc Of course, with __syncthreads()

CUDA-GDB CUDA kernel runs on GPU, so native system API does not apply here (ex: cout) Although CUDA compute capability > 2.x support printf, but using printf to debug is not realistic. CUDA-gdb is a debugger based on gdb, and simulates GPU threads in CPU threads

Optimization To make your CUDA program fast, you need: Avoid memory copy between CPU and GPU memory Use cache (shared memory) in your kernel Choose block number Array alignment Continuous memory access Use CUDA APIs

Optimization Avoid memory copy between CPU and GPU memory Memory copy between CPU and GPU is expensive (but not extreme expensive, you can still use it, but try to avoid it)

Optimization Use cache (shared memory) in your kernel This can be the key to optimize the CUDA program since avoid memory copy between CPU and GPU is not that hard In most case, the cache is hard to implement due to the size limit, or you don’t know how to make cache plan

Optimization Choose block number More block or more threads is hard to choose, actually, it is problem dependent. We need to understand how CUDA grid, block, thread map to read GPU cores

In GPU, the unit of process is SP (streaming processor); several SP and some components are composed as a SM (streaming multiprocessor); several SM is composed are composed as TPC (texture processing cluster) In CUDA, we can roughly say that a grid was processed by whole GPU, block is processed by SM, and thread is processed by SP.

Every 32 threads are composed as a wrap Every 32 threads are composed as a wrap. If you choose a number can not be divided by 32, the rest of the treads are composed as a wrap Every time, a SM only processed a wrap, thus if a wrap has less then 32 threads, you will make some SP idle, and make waste.

When a thread is waiting for data, the SM will chose another threads to execute, thus hide the memory access latency Thus, more threads in one block can hide such latency more; but more threads in one block means the available shared memory per threads is less. From NVIDIA’s suggestion, one block need at least 196 threads to hide the memory access latency

Optimization Array alignment Memory access can have better performance if the data items are aligned at 64 byte boundary Hence, align 2D array that each row starts at a 64 byte address will import performance But this is difficult for programmer!

We will pad some dummy byte to each data item pitch

Then use cudaMemcpy2D with the pitch size to copy data You can use cudaMallocPitch to let GPU help decide the pitch, and allocate memory with better access performance Then use cudaMemcpy2D with the pitch size to copy data Disadvantage: Harder for programmer Waste memory

Optimization Continuous memory access If we access the memory in a continuous pattern, we can improve performance ex: memcpy a block of continuous memory

Optimization Use CUDA APIs CUDA API supports lots of basic math functions, like sin, log, cuRand (random library), etc. Using these APIs will increase the performance

Conclusion GPU is a powerful computing tool with hundreds of thousands of threads; but programming GPU is not a simply thing, incorrect programming pattern will even decrease the performance CUDA is a GPGPU computing model, acting as a computing assistant of CPU GPGPU is very powerful, but only in some area (scientific applicaions)

References http://neuralnetworksanddeeplearning.com/chap6.html http://www.stcorp.no/technology/bioinformatics/ http://www.qcfinance.in http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf http://cs.brown.edu/courses/cs195v/lecture/week10.pdf http://www.nvidia.com/object/gpu-applications.html http://supercomputingblog.com/cuda-tutorials/ https://kheresy.wordpress.com/2008/07/09/cuda-%E7%9A%84-threading%EF%BC%9Ablock-%E5%92%8C-grid-%E7%9A%84%E8%A8%AD%E5%AE%9A%E8%88%87-warp/#more-730 http://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf http://stackoverflow.com/questions/10256402/why-is-the-constant-memory-size-limited-in-cuda http://stackoverflow.com/questions/19309800/cuda-how-to-launch-a-new-kernel-call-in-one-kernel-function http://www-inst.eecs.berkeley.edu/~cs61c/sp14/labs/10/