CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Programming with Shared Memory Specifying parallelism
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
Presentation transcript:

CS 179: GPU Computing LECTURE 2: MORE BASICS

Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA code into.cu and.cuh files and compile with nvcc to create object files (.o files) Looked at the a[] + b[] -> c[] example

Recap If you forgot everything, just make sure you understand that CUDA is simply an extension of other bits of code you write!!!! ◦Evident in.cu/.cuh vs.cpp/.hpp distinction ◦.cu/.cuh is compiled by nvcc to produce a.o file ◦.cpp/.hpp is compiled by g++ and the.o file from the CUDA code is simply linked in using a "#include xxx.cuh" call ◦No different from how you link in.o files from normal C++ code

.cu/.cuh vs.cpp/.hpp

Thread Organization We will now look at how threads are organized and used in GPUs ◦Keywords you MUST know to code in CUDA: ◦Thread ◦Block ◦Grid ◦Keywords you MUST know to code WELL in CUDA: ◦(Streaming) Multiprocessor ◦ Warp ◦ Warp Divergence

Inside a GPU The black Xs are just crossing out things you don’t have to think about just yet. You'll learn about them later

Inside a GPU Think of Device Memory (we will also refer to it as Global Memory) as a RAM for your GPU ◦Faster than getting memory from the actual RAM but still can be faster ◦Will come back to this in future lectures GPUs have many Streaming Multiprocessors (SMs) ◦Each SM has multiple processors but only one instruction unit ◦Groups of processors must run the exact same set of instructions at any given time with in a single SM

Inside a GPU When a kernel (the thing you define in.cu files) is called, the task is divided up into threads ◦Each thread handles a small portion of the given task The threads are divided into a Grid of Blocks ◦Both Grids and Blocks are 3 dimensional ◦e.g. dim3 dimBlock(8, 8, 8); dim3 dimGrid(100, 100, 1); Kernel >>(…); ◦However, we'll often only work with 1 dimensional grids and blocks ◦e.g. Kernel >>(…);

Inside a GPU Maximum number of threads per block count is usually 512 or 1024 depending on the machine Maximum number of blocks per grid is usually ◦If you go over either of these numbers your GPU will just give up or output garbage data ◦Much of GPU programming is dealing with this kind of hardware limitations! Get used to it ◦This limitation also means that your Kernel must compensate for the fact that you may not have enough threads to individually allocate to your data points ◦Will show how to do this later (this lecture)

Inside a GPU Each block is assigned to an SM Inside the SM, the block is divided into Warps of threads ◦Warps consist of 32 threads ◦All 32 threads MUST run the exact same set of instructions at the same time ◦Due to the fact that there is only one instruction unit ◦Warps are run concurrently in an SM ◦If your Kernel tries to have threads do different things in a single warp (using if statements for example), the two tasks will be run sequentially ◦Called Warp Divergence (NOT GOOD)

Inside a GPU (fun hardware info) In Fermi Architecture (i.e. GPUs with Compute Capability 2.x), each SM has 32 cores ◦e.g. GTX 400, 500 series ◦32 cores is not what makes each warp have 32 threads. Previous architecture also had 32 threads per warp but had less than 32 cores per SM Halo.cms.caltech.edu has 3 GTX 570s ◦This course will cover CC 2.x

Streaming Multiprocessor

A[] + B[] -> C[] (again)

Questions so far?

Stuff that will be useful later

Next Time... Global Memory access is not that fast ◦ Tends to be the bottleneck in many GPU programs ◦Especially true if done stupidly ◦We'll look at what "stupidly" means Optimize memory access by utilizing hardware specific memory access patterns Optimize memory access by utilizing different caches that come with the GPU