Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.
Speed, Accurate and Efficient way to identify the DNA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Killdevil Running CUDA programs on cluster. Requesting permission bin/unc_id/services bin/unc_id/services.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Presentation transcript:

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008

Previously CUDA programming model CUDA programming model –GPU as co-processor –Kernel definition and invocation –Thread blocks – 1D, 2D, 3D –Thread ID and threadIdx –Global/shared memory for threads –Compute capability

Today Theory/practical course? Theory/practical course? CUDA programming model CUDA programming model –Limitations on number of threads –Grids of thread blocks

Today Theory/practical course? Theory/practical course? –The course is meant to be practical –Programming with CUDA –Is that a problem for some of you? –Should we change something?

The CUDA Programming Model (cont'd)‏

Number of threads A kernel is executed on the device simultaneously by many threads dim3 blockSize(Dx,Dy,Dz); // for 1D block, Dy = 1 // for 2D block, Dz = 1 kernel >>(...)‏ A kernel is executed on the device simultaneously by many threads dim3 blockSize(Dx,Dy,Dz); // for 1D block, Dy = 1 // for 2D block, Dz = 1 kernel >>(...)‏ –# threads = blockSize = Dx*Dy*Dz

A bit about the hardware The GPU consists of several multiprocessors The GPU consists of several multiprocessors Each multiprocessor consists of several processors Each multiprocessor consists of several processors Each processor in a multiprocessor has its local memory in the form of registers Each processor in a multiprocessor has its local memory in the form of registers All processors in a multiprocessor have access to a shared memory All processors in a multiprocessor have access to a shared memory

Threads and processors All threads in a block run on the same multiprocessor. All threads in a block run on the same multiprocessor. –They might not all run at the same time –Therefore, threads should be independent of each other – __syncthreads() causes all threads to reach the same execution point before carrying on.

Threads and processors How many threads can run on a multiprocessor? How many threads can run on a multiprocessor? –how much memory the multiprocessor has –how much memory does each thread require

Threads and processors How many threads can a block have? How many threads can a block have? –how much memory the multiprocessor has –how much memory does each thread require

Grids of Blocks What if I want to run more threads? What if I want to run more threads? –Call multiple blocks of threads –These form a grid of blocks A grid can be 1D or 2D A grid can be 1D or 2D

Grids of Blocks Example of 1D grid Invoke (in main): int N; // assign some value to N dim3 blockDimension (N,N); kernel >> (...); Example of 1D grid Invoke (in main): int N; // assign some value to N dim3 blockDimension (N,N); kernel >> (...); Example of 2D grid Invoke (in main): int N; // assign some value to N dim3 blockDimension (N,N); dim3 gridDimension (N,N); kernel >> (...); Example of 2D grid Invoke (in main): int N; // assign some value to N dim3 blockDimension (N,N); dim3 gridDimension (N,N); kernel >> (...);

Grids of Blocks Invoking a grid: kernel >> (...); Invoking a grid: kernel >> (...); – # threads = gridDimension* blockDimension

Accessing block information Grids can be 1D or 2D Grids can be 1D or 2D The index of a block in a grid is available through the blockIdx variable The index of a block in a grid is available through the blockIdx variable The dimension of a block is available through the blockDim vairable The dimension of a block is available through the blockDim vairable

Arranging blocks Threads in a block should be independent of other threads in the block Threads in a block should be independent of other threads in the block Blocks in a grid should be independent of other blocks in the grid Blocks in a grid should be independent of other blocks in the grid

Memory available to threads Each thread has a local memory Each thread has a local memory Threads in a block share a shared memory Threads in a block share a shared memory All threads can access the global memory All threads can access the global memory

Memory available to threads All threads have read-only access to constant and texture memories All threads have read-only access to constant and texture memories

Memory available to threads An application is expected to manage An application is expected to manage –global, constant and texture memory spaces –Data transfer between host and device memories –(de)allocating host and device memory

Have a nice weekend See you next time