CS179: GPU Programming Lecture 8: More CUDA Runtime.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

Multi-GPU and Stream Programming Kishan Wimalawarne.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Martin Kruliš by Martin Kruliš (v1.0)1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

GPU Computing CIS-543 Lecture 10: Streams and Events

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Heterogeneous Programming

Basic CUDA Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

Some things are naturally parallel

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Operating Systems Chapter 5: Input/Output Management

Measuring Performance

CUDA Execution Model – III Streams and Events

Measuring Performance

Presentation transcript:

CS179: GPU Programming Lecture 8: More CUDA Runtime

Today  CUDA arrays for textures  CUDA runtime  Helpful CUDA functions

CUDA Arrays  Recall texture memory  Used to store large data  Stored on GPU  Accessible to all blocks, threads

CUDA Arrays  Used Texture memory for buffers (lab 3)  Allows vertex data to remain on GPU  How else can we access texture memory?  CUDA arrays

CUDA Arrays  Why CUDA arrays over normal arrays?  Better caching, 2D caching  Spatial locality  Supports wrapping/clamping  Supports filtering

CUDA Linear Textures  “Textures” but in global memory  Usage:  Step 1: Create texture reference  texture tex  TYPE = float, float3, int, etc.  Step 2: Bind memory to texture reference  cudaBindTexture(offset, tex, devPtr, size);  Step 3: Get data on device via tex1Dfetch  tex1DFetch(tex, x);  x is the byte where we want to read!  Step 4: Clean up after finished  cudaUnbindTexture(&tex)

CUDA Linear Textures  Texture reference properties:  texRef  type = float, int, float3, etc.  dim = # of dimensions (1, 2, or 3)  mode =  cudaReadModeElementType: standard read  cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats

CUDA Linear Textures  Important warning:  Textures are in a global space of memory  Threads can read and write to texture at same time  This can cause synchronization problems!  Do not rely on thread running order, ever

CUDA Linear Textures  Other limitations:  Only 1D, can make indexing and caching a bit less convenient  Pitch may be not ideal for 2D array  Not read-write  Solution: CUDA arrays

CUDA Arrays  Live in texture memory space  Access via texture fetches

CUDA Arrays  Step 1: Create channel description  Tells us texture attributes  cudaCreateChannelDesc(int x, int y, int z, int w, enum mode)  x, y, z, w are number of bytes per component  mode is cudaChannelFormatKindFloat, etc.

CUDA Arrays  Step 2: Allocate memory  Must be done dynamically  Use cudaMallocArray(cudaArray **array, struct desc, int size)  Most global memory functions work with CUDA arrays too  cudaMemcpyToArray, etc.

CUDA Arrays  Step 3: Create texture reference  texture texRef -- just as before  Parameters must match channel description where applicable  Step 4: Edit texture settings  Settings are encoded as texRef struct members

CUDA Arrays  Step 5: Bind the texture reference to array  cudaBindTextureToArray(texRef, array)  Step 6: Access texture  Similar to before, now we have more options:  tex1DFetch(texRef, x)  tex2DFetch(texRef, x, y)

CUDA Arrays  Final Notes:  Coordinates can be normalized to [0, 1] if in float mode  Filter modes: nearest point or linear  Tells CUDA how to blend texture  Wrap vs. clamp:  Wrap: out of bounds accesses wrap around to other side  Ex.: (1.5, 0.5) -> (0.5, 0.5)  Clamp: out of bounds accesses set to border value  Ex.: (1.5, 0.5) -> (1.0, 0.5)

CUDA Arrays point sampling linear sampling

CUDA Arrays wrap clamp

CUDA Runtime  Nothing new, every function cuda____ is part of the runtime  Lots of other helpful functions  Many runtime functions based on making your program robust  Check properties of card, set up multiple GPUs, etc.  Necessary for multi-platform development!

CUDA Runtime  Starting the runtime:  Simply call a cuda_____ function!  CUDA can waste a lot of resources  Stop CUDA with cudaThreadExit()  Called automatically on CPU exit, but you may want to call earlier

CUDA Runtime  Getting devices and properties:  cudaGetDeviceCount(int * n);  Returns # of CUDA-capable devices  Can use to check if machine is CUDA-capable!  cudaSetDevice(int n)  Sets device n to the currently used device  cudaGetDeviceProperties(struct *devProp prop, int n);  Loads data from device n into prop

Device Properties  char name[256]: ASCII identifier of GPU  size_t totalGlobalMem: Total global memory available  size_t sharedMemPerBlock: Shared memory available per multiprocessor  int regsPerBlock: How many registers we have per block  int warpSize: size of our warps  size_t memPitch: maximum pitch allowed for array allocation  int maxThreadsPerBlock: maximum number of threads/block  int maxThreadsDim[3]: maximum sizes of a block

Device Properties  int maxGridSize[3]: maximum grid sizes  size_t totalConstantMemory: maximum available constant memory  int major, int minor: major and minor versions of CUDA support  int clockRate: clock rate of device in kHz  size_t textureAlignment: memory alignment required for textures  int deviceOverlap: Does this device allow for memory copying while kernel is running? (0 = no, 1 = yes)  int multiprocessorCount: # of multiprocessors on device

Device Properties  Uses?  Actually get values for memory, instead of guessing  Program to be accessible for multiple systems  Can get the best device

Device Properties  Getting the best device:  Pick a metric (Ex.: most multiprocessors could be good) int num_devices, device; cudaGetDeviceCount(&num_devices); if (num_devices > 1) { int max_mp = 0, best_device = 0; for (device = 0; device < num_devices; device++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop, device); int mp_count = prop.multiProcessorCount; if (mp_count > max_mp) { max_mp = mp_count; best_device = device; } cudaSetDevice(best_device); }

Device Properties  We can also use this to launch multiple GPUs  Each GPU must have its own host thread  Multithread on CPU, each thread calls different device  Set device on thread using cudaSetDevice(n);

CUDA Runtime  Synchronization Note:  Most calls to GPU/CUDA are asynchronous  Some are synchonous (usually things dealing with memory)  Can force synchronization:  cudaThreadSynchronize()  Blocks until all devices are done  Good for error checking, timing, etc.

CUDA Events  Great for timing!  Can place event markers in CUDA to measure time  Example code: cudaEvent_t start, stop; cudaCreateEvent(&start); cudaCreateEvent(&stop); cudaEventRecord(start, 0); // DO SOME GPU CODE HERE cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsed_time; cudaEventElapsedTime(&elapsed_time, start, stop);

CUDA Streams  Streams manage concurrency and ordering  Ex.: call malloc, then kernel 1, then kernel 2, etc.  Calls in different streams are asynchronous!  Don’t know when each stream is where in code

Using Streams  Create stream  cudaStreamCreate(cudaStream_t *stream)  Copy memory using async calls:  cudaMemcpyAsync(…, cudaStream_t stream)  Call in kernel as another parameter:  kernel >>  Query if stream is done:  cudaStreamQuery(cudaStream_t stream)  returns cudaSuccess if stream is done, cudaErrorNotReady otherwise  Block process until a stream is done:  cudaStreamSynchronize(cudaStream_t stream)  Destroy stream & cleanup:  cudaStreamDestroy(cudaStream_t stream)

Using Streams  Example: cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]); for (int i = 0; i < 2; ++i) myKernel >>(outputDevPtr + i * size, inputDevPtr + i * size, size); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]); cudaThreadSynchronize();

Next Time  Lab 4 Recitation:  3D Textures  Pixel Buffer Objects (PBOs)  Fractals!