Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
What is GPGPU? Many of these slides are taken from Henry Neeman’s presentation at the University of Oklahoma.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
CUDA and the Memory Model (Part II). Code executed on GPU.
Graphics Processing Units
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Lecture 2: Intro to the simd lifestyle and GPU internals
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Programming Massively Parallel Graphics Processors
Introduction to CUDA.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
Presentation transcript:

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC

Jargon Alert This field is filled with jargon –CUDA != GPGPU programming CUDA is a C-like language that facilitates getting your code onto the GPU GPGPU programming is assembly/bit manipulation right on the hardware. –Big difference? You make the call.

History Mid 90's –Stream Processors at Stanford: Imagine and Merrimac –SIMD style of processing –Programming Language for Merrimac: Brook –Adapted to GPUs Brook adopted by ATI as a programming language for their GPUs NVIDIA: CUDA – some commonality with Brook

Speedup If you can figure out the right mapping, speedups can be extraordinary Source:

CPU vs. GPU Layout Source: Nvidia CUDA Programming Guide

Buzzwords Kernel –Code that will be run in lock-step on the stream processors of the GPU Thread –An execution of a kernel with a given index. Each thread uses its index to access elements in array such that the collection of all threads cooperatively processes the entire data set.

Buzzwords Block –A group of threads. Just like those dastardly pthread'ed threads, these could execute concurrently or independently and in no particular order. Threads can be coordinated somewhat, using the _syncthreads() function as a barrier, making all threads stop at a certain point in the kernel before moving on en mass.

Buzzwords Grid –This is a group of (thread)blocks. –There’s no synchronization at all between the blocks.

Hierarchy Grids map to GPUs Blocks map to the MultiProcessors (MP)‏ –Blocks are never split across MPs, but MPs can have multiple blocks Threads map to Stream Processors (SP)‏ Warps are groups of (32) threads that execute simultaneously Image Source: Nvidia CUDA Programming Guide

Getting your feet wet In your home directory on LittleFe, there's a subdirectory called “CUDA-Examples” Inside of this directory, pull up the file movearrays.cu “.cu ” is the convention for “CUDA” Source: Dr. Dobb's Journal, hpc-high- performance-computing,

Getting your feet wet Looking at the code: –Looks a lot like C, Yay! Boo! –Notice the cudaNAME calls cudaMalloc,cudaMemcpy,cudaFree –This code simply creates memory locally, creates memory “remotely,” copies memory from the host to the device, copies memory from the device to the device, then copies memory back from the device to the host

Compiling nvcc movearrays.cu -o movearrays Run it with./movearrays Wow! Did you see that output! Look at movearrays2.cu –Example showing data movement, and our first look at arithmetic operations on the device –Compare to arithmetic operations on host

Coding it up If you like C, you're golden at this point. –Looking at the code, you see that the function call that “does stuff” on the device, incrementArrayOnDevice() is declared as __global__ void –Notice also the decomposition declarations with blocksize and the number of blocks.

Coding it up blockIdx.x is a built-in variable in CUDA that returns the blockId in the x axis of the block that is executing this block of code threadIdx.x is another built-in variable in CUDA that returns the threadId in the x axis of the thread that is being executed by this stream processor in this particular block

Coding it up How do you know where to break? Each multiprocessor has the ability to actively process multiple blocks at one time How many depends on the number of registers per thread and how much shared memory per block is required by a given kernel. E.g. “It depends!”

Coding it up Getting CUDA working: – Visit the download area Download the CUDA driver Also download the CUDA toolkit –Optionally want the CUDA SDK, meaning you really want the CUDA SDK –Installation of the driver is relatively straightforward Windows XP/Vista Linux x86/x86_64

Coding it up Getting CUDA working: –Even if you don't have a supported NVidia card, you can download the toolkit. Toolkit includes an NVidia emulator This means that you can use it in a classroom setting, where students can have universal tools to develop off-host. –debug –debug-emu –release –release-emu

Coding it up Getting CUDA working: –More CUDA than you can shake a “fill out the surveys” stick at: /opt/cuda-sdk/projects /opt/cuda-sdk/bin/linux/release