Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

David Notkin Autumn 2009 CSE303 Lecture 13 This space for rent.

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS179: GPU Programming Lecture 8: More CUDA Runtime.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

EECC250 - Shaaban #1 Lec # 5 Winter Stacks A stack is a First In Last Out (FILO) buffer containing a number of data items usually implemented.

 2006 Pearson Education, Inc. All rights reserved Midterm review Introduction to Classes and Objects.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

CUDA and the Memory Model (Part II). Code executed on GPU.

Pointers Applications

1 Procedural Concept The main program coordinates calls to procedures and hands over appropriate data as parameters.

Virtual Functions Junaed Sattar November 10, 2008 Lecture 10.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 13: Pointers, Classes, Virtual Functions, and Abstract Classes.

C++ Programming: From Problem Analysis to Program Design, Fourth Edition Chapter 14: Pointers, Classes, Virtual Functions, and Abstract Classes.

Prof. amr Goneid, AUC1 CSCE 110 PROGRAMMING FUNDAMENTALS WITH C++ Prof. Amr Goneid AUC Part 10. Pointers & Dynamic Data Structures.

Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Introduction. 2COMPSCI Computer Science Fundamentals.

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Stack and Heap Memory Stack resident variables include:

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

1 Scope Scope describes the region where an identifier is known, and semantic rules for this.

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.

CSci 125 Lecture 21 Martin van Bommel. Memory Allocation Variable declarations cause compiler to reserve memory to hold values - allocation Global variables.

Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.

1 Homework HW5 due today Review a lot of things about allocation of storage that may not have been clear when we covered them in our initial pass Introduction.

Lecture 3 Classes, Structs, Enums Passing by reference and value Arrays.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

EEL 3801 C++ as an Enhancement of C. EEL 3801 – Lotzi Bölöni Comments  Can be done with // at the start of the commented line.  The end-of-line terminates.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 Chapter 15-1 Pointers, Dynamic Data, and Reference Types Dale/Weems.

Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

CUDA Programming Model

Names and Attributes Names are a key programming language feature

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Basic CUDA Programming

System Structure and Process Model

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

CUDA Programming Model

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

CUDA Fortran Programming with the IBM XL Fortran Compiler

Presentation transcript:

Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008

Previously CUDA Runtime Component CUDA Runtime Component –Common Component Data types, math functions, timing, textures Data types, math functions, timing, textures –Device Component Math functions, warp voting, atomic functions, synch function, texturing Math functions, warp voting, atomic functions, synch function, texturing –Host Component High-level runtime API High-level runtime API Low-level driver API Low-level driver API

Previously CUDA Runtime Component CUDA Runtime Component –Host Component APIs Mutually exclusive Mutually exclusive Runtime API is easier to program, hides some details from programmer Runtime API is easier to program, hides some details from programmer Driver API gives low level control, harder to program Driver API gives low level control, harder to program Provide: device initialization, management of device, streams and events Provide: device initialization, management of device, streams and events

Today CUDA Runtime Component CUDA Runtime Component –Host Component APIs Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered)‏ Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered)‏ Runtime API provides: emulation mode for debugging Runtime API provides: emulation mode for debugging Driver API provides: management of contexts & modules, execution control Driver API provides: management of contexts & modules, execution control Final Projects Final Projects

Memory Management: Linear Memory Memory Management: Linear Memory –CUDA Runtime API Declare: TYPE* Allocate: cudaMalloc, cudaMallocPitch Copy: cudaMemcpy, cudaMemcpy2D Free: cudaFree –CUDA Driver API Declare: CUdeviceptr Allocate: cuMemAlloc, cuMemAllocPitch Copy: cuMemcpy, cuMemcpy2D Free: cuMemFree Host Runtime Component

Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – expected: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – expected, WRONG: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – CORRECT: // host code float *array2D; int pitch; cudaMallocPitch ((void**) array2D, &pitch, width*sizeof (float), height); // device code for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*pitch; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – why? Allocation using pitch functions appropriately pads memory for efficient transfer and copy Allocation using pitch functions appropriately pads memory for efficient transfer and copy Width of allocated rows may exceed width*sizeof(float)‏ Width of allocated rows may exceed width*sizeof(float)‏ True width given by pitch True width given by pitch Host Runtime Component

Memory Management: CUDA Arrays Memory Management: CUDA Arrays –CUDA Runtime API Declare: cudaArray* Channel: cudaChannelFormatDesc, cudaCreateChannelDesc Allocate: cudaMallocArray Copy (from linear): cudaMemcpy2DToArray Free: cudaFreeArray Host Runtime Component

Memory Management: CUDA Arrays Memory Management: CUDA Arrays –CUDA Driver API Declare: CUarray Channel: CUDA_ARRAY_DESCRIPTOR object Allocate: cuArrayCreate Copy (from linear): CUDA_MEMCPY2D object Free: cuArrayDestroy Host Runtime Component

Memory Management: various other functions to copy from Memory Management: various other functions to copy from –Linear memory to CUDA arrays –Host to constant memory –See Reference Manual Host Runtime Component

Texture Management Texture Management –Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } – normalized : 0: false, otherwise true Host Runtime Component

Texture Management Texture Management – filterMode: cudaFilterModePoint: no filtering, returned value is of nearest texel cudaFilterModeLinear: filters 2/4/8 neighbors for 1D/2D/3D texture, floats only – addressMode: (x,y,z) cudaAddressModeClamp, cudaAddressModeWrap: normalized coordinates only Host Runtime Component

Texture Management Texture Management – channelDesc : texel type struct cudaChannelFormatDesc { int x,y,z,w; enum cudaChannelFormatKind f; } x,y,z,w : #bits per component x,y,z,w : #bits per component f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat Host Runtime Component

Texture Management Texture Management –Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } –Apply only to texture references bound to CUDA arrays Host Runtime Component

Texture Management Texture Management –Binding a texture reference to a texture Runtime API: Runtime API: –Linear memory: cudaBindTexture –CUDA Array: cudaBindTextureToArray Driver API: Driver API: –Linear memory: cuTexRefSetAddress –CUDA Array: cuTexRefSetArray Host Runtime Component

Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –No native debug support for device code –Code should be compiled either for device emulation OR execution: mixing not allowed –Device code is compiled for the host Host Runtime Component

Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Features Each CUDA thread is mapped to a host thread, plus one master thread Each CUDA thread is mapped to a host thread, plus one master thread Each thread gets 256KB on stack Each thread gets 256KB on stack Host Runtime Component

Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Advantages Can use host debuggers Can use host debuggers Can use otherwise disallowed functions in device code, e.g. printf Can use otherwise disallowed functions in device code, e.g. printf Device and host memory are both readable from either device or host Device and host memory are both readable from either device or host Host Runtime Component

Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Advantages Any device or host specific function can be called from either device or host code Any device or host specific function can be called from either device or host code Runtime detects incorrect use of synch functions Runtime detects incorrect use of synch functions Host Runtime Component

Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Some errors may still remain hidden Memory access errors Memory access errors Out of context pointer operations Out of context pointer operations Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Result of FP operations often different on host and device Result of FP operations often different on host and device Host Runtime Component

Driver API: Context management Driver API: Context management –A context encapsulates all resources and actions performed within the driver API –Almost all CUDA functions operate in a context, except those dealing with Device enumeration Device enumeration Context management Context management Host Runtime Component

Driver API: Context management Driver API: Context management –Each host thread can have only one current device context at a time –Each host thread maintains a stack of current contexts – cuCtxCreate()‏ Creates a context Creates a context Pushes it to the top of the stack Pushes it to the top of the stack Makes it the current context Makes it the current context Host Runtime Component

Driver API: Context management Driver API: Context management – cuCtxPopCurrent()‏ Detaches the current context from the host thread – makes it “uncurrent” Detaches the current context from the host thread – makes it “uncurrent” The context is now floating The context is now floating It can be pushed to any host thread's stack It can be pushed to any host thread's stack Host Runtime Component

Driver API: Context management Driver API: Context management –Each context has a usage count cuCtxCreate creates a context with a usage count of 1 cuCtxCreate creates a context with a usage count of 1 cuCtxAttach increments the usage count cuCtxAttach increments the usage count cuCtxDetach decrements the usage count cuCtxDetach decrements the usage count Host Runtime Component

Driver API: Context management Driver API: Context management –A context is destroyed when its usage count reaches 0. cuCtxDetach, cuCtxDestroy cuCtxDetach, cuCtxDestroy Host Runtime Component

Driver API: Module management Driver API: Module management –Modules are dynamically loadable packages of device code and data output by nvcc Similar to DLLs Similar to DLLs Host Runtime Component

Driver API: Module management Driver API: Module management –Dynamically loading a module and accessing its contents CUmodule cuModule; cuModuleLoad(&cuModule, “myModule.cubin”); CUfunction cuFunction; cuModuleGetFunction(&cuFunction, cuModule, “myKernel”); Host Runtime Component

Driver API: Execution control Driver API: Execution control –Set kernel parameters cuFuncSetBlockShape()‏ cuFuncSetBlockShape()‏ –#threads/block for the function –How thread IDs are assigned cuFuncSetSharedSize()‏ cuFuncSetSharedSize()‏ –Size of shared memory cuParam*()‏ cuParam*()‏ –Specify other parameters for next kernel launch Host Runtime Component

Driver API: Execution control Driver API: Execution control –Launch kernel cuLaunch(), cuLaunchGrid()‏ cuLaunch(), cuLaunchGrid()‏ –Example in Prog Guide Host Runtime Component

Final Projects Ideas? Ideas? –DES cracker –Image editor Resize and smooth an image Resize and smooth an image Gamut mapping? Gamut mapping? –3D Shape matching

All for today Next time Next time –Memory and Instruction optimizations

On to exercises!