Device Routines and device variables

Slides:

Advertisements

Similar presentations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.

CUDA Grids, Blocks, and Threads

Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.

CIS 565 Fall 2011 Qing Sun

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

CS415 C++ Programming Takamitsu Kawai x4212 G11 CERC building WV Virtual Environments Lab West Virginia University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Synchronization These notes introduce:

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Programming Model

GPU Memories These notes will introduce:

CUDA Grids, Blocks, and Threads

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Device Routines and device variables

Measuring Performance

CUDA Execution Model – III Streams and Events

CUDA Grids, Blocks, and Threads

CUDA Programming Model

Programming with Shared Memory

Measuring Performance

CUDA Programming Model

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Synchronization These notes introduce:

6- General Purpose GPU Programming

Presentation transcript:

Device Routines and device variables These notes introduce: Declaring routines that are be executed on device and on the host Declaring local variable on device ITCS 4/5145 Parallel Programming, B. Wilkinson, July 11, 2012. CUDADeviceRoutines.ppt

CUDA extensions to declare kernel routines Host = CPU Device = GPU __global__ indicates routine can only be called from host and only executed on device __device__ indicates routine can only be called from device and only executed on device __host__ indicates routine can only be called from host and only executed on host (generally only used in combination with __device__ , see later) Two underscores each Note: Cannot call a routine from the kernel to be executed on host

So far we have seen __global__: … __global__ void add(int *a,int *b, int *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N) c[tid] = a[tid]+b[tid]; } int main(int argc, char *argv[]) { int T = 10, B = 1; // threads per block and blocks per grid int a[N],b[N],c[N]; int *dev_a, *dev_b, *dev_c; cudaMalloc((void**)&dev_a,N * sizeof(int)); cudaMalloc((void**)&dev_b,N * sizeof(int)); cudaMalloc((void**)&dev_c,N * sizeof(int)); cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice); add<<<B,T>>>(dev_a,dev_b,dev_c); cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; __global__ must have void return type. Why? Executed on device Called from host Note __global__ asynchronous. Returns before complete

Routines to be executed on device Generally cannot call C library routines from device! However CUDA has math routines for device that are equivalent to standard C math routines with the same names, so in practice can call math routines such as sin(x) – need to check CUDA docs* before use Also CUDA has GPU-only routines implemented, faster less accurate (have __ names)* * See NVIDIA CUDA C Programming Guide for more details

__device__ routines __global__ void gpu_sort (int *a, int *b, int N) { … swap (&list[m],&list[j]); } __device__ void swap (int *x, int *y) { int temp; temp = *x; *x = *y; *y = temp; int main (int argc, char *argv[]) { gpu_sort<<< B, T >>>(dev_a, dev_b, N); return 0; Recursion is possible with __device __ routines so far as I can tell

Routines executable on both host and device __device__ and __host__ qualifiers can be used together Then routine callable and executable on both host and device. Routine will be compiled for both. Feature might be used to create code that optionally uses a GPU or for test purposes. Generally will need statements that differentiate between host and device Note: __global__ and __host__ qualifiers cannot be used together.

__CUDA_ARCH__ macro Indicates compute capability of GPU being used. Can be used to create different paths thro device code for different capabilities. __CUDA_ARCH__ = 100 for 1.0 compute capability __CUDA_ARCH__ = 110 for 1.1 compute capability …

Example __host__ __device__ func() { #ifdef __CUDA_ARCH__ … // Device code #else … // Host code #endif } Could also select specific compute capabilities

Declaring local variables for host and for device

Local variables on host #include <stdio.h> #include <stdlib.h> int cpuA[10]; ... void clearArray() { for (int i = 0; i < 10; i++) cpuA[i] = 0; } void setArray(int n) { cpuA[i] = n; int main(int argc, char *argv[]) { … clearArray(); setArray(N); return 0; Local variables on host In C, scope of a variable is block it is declared in, which does not extend to routines called from block. If scope is to include main and all within it, including called routines, place declaration outside main:

Declaring local kernel variables #include <stdio.h> #include <stdlib.h> __device__ int gpu_A[10]; ... __global__ void clearArray() { for (int i = 0; i < 10; i++) gpuA[i] = 0; } int main(int argc, char *argv[]) { … clearArray<<<…>>>(); setArray<<<…>>>(N); return 0; Declaring variable outside main but use __device__ (now used as a variable type qualifier rather than function type qualifier) Without further qualification, variable is in global (GPU) memory. Accessible by all threads

Accessing kernel variables from host Accessible by host using routines: cudaMemcpyToSymbol() to copy to device cudaMemcpyFromSymbol() to copy from device where name of kernel variable given as a String argument: int main(int argc, char *argv[]) { int cpuA[10]: … cudaMemcpyFromSymbol(&cpuA, "gpuA", sizeof(cpuA), 0, cudaMemcpyDeviceToHost); return 0; } Contents of gpuA[] copied to cpuA[] Name of variable declared in kernel

Example of both local host and device variables #include <stdio.h> #include <cuda.h> #include <stdlib.h> int cpu_hist[10]; // accessible on cpu only __device__ int gpu_hist[10]; // accessible on gpu only void cpu_histogram(int *a, int N) { … // result in cpu_hist[] } __global__ void gpu_histogram(int *a, int N) { … // result in gpu_hist[] int main(int argc, char *argv[]) { … gpu_histogram<<<B,T>>>(dev_a,N); // computed on gpu cpu_histogram(a,N); // computed on cpu return 0; Example of both local host and device variables

Questions