Download presentation
Presentation is loading. Please wait.
Published byRoberta Mosley Modified over 9 years ago
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming Lecture 3: Introduction to CUDA C (Part 2)
2
Update – TA Office Hours TA office hours will be held at the following times in CSL369: –Wednesday 2:00p.m. - 3:00p.m. –Friday 11:00 a.m. - 12:00 p.m. More TA office hours will be added on a need basis © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 2
3
Tentative Plan for Labs Lab 0: Account Setup –Released: Thu, Aug 30 (Week 1) –Due: N/A Lab 1: Vector Add –Released: Tue, Sep 4 (Week 2) –Due: Tue, Sep 11 (Week 3) Lab 2: Basic Matrix Multiplication –Released: Tue, Sep 11 (Week 3) –Due: Tue, Sep 18 (Week 4) Lab 3: Tiled Matrix Multiplication –Released: Tue, Sep 18 (Week 4) –Due: Tue, Sep 25 (Week 5) Lab 4: Tile Convolution –Released: Tue, Sep 25 (Week 5) –Due: Tue, Oct 9 (Week 7) Lab 5: Reduction & Prefix Scan –Released: Tue, Oct 9 (Week 7) –Due: Tue, Oct 23 (Week 9) Lab 6: Histogramming –Released: Tue, Oct 23 (Week 9) –Due: Tue, Nov 6 (Week 11) Project: –Release & kickoff Lecture: Tue, Oct 23 (Week 9) –Proposals for Non-competition Projects Due: TBD –Progress Report: TBD (if any) –Presentations: TBD © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 3
4
Objective To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism –Hierarchical thread organization –Main interfaces for launching parallel execution –Thread index(es) to data index mapping © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 4
5
CUDA C /OpenCL – Execution Model Integrated host+device app C program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code Serial Code (host)... Parallel Kernel (device) KernelA >>(args); Serial Code (host) Parallel Kernel (device) KernelB >>(args); 5
6
void vecAdd(float* h_A, float* h_B, float* h_C, int n) { int size = n* sizeof(float); float* d_A, d_B, d_C; … 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors } CPU Host Memory GPU Part 2 Device Memory Part 1 Part 3 Heterogeneous Computing vecAdd CUDA Host Code 6 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign
7
Partial Overview of CUDA Memories Device code can: –R/W per-thread registers –R/W all-shared global memory Host code can –Transfer data to/from per grid global memory (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers Host 7 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign We will cover more later.
8
Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers Host CUDA Device Memory Management API functions cudaMalloc() global memory –Allocates object in the device global memory –Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes cudaFree() –Frees object from device global memory Pointer to freed object 8 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign
9
Host Host-Device Data Transfer API functions cudaMemcpy() –memory data transfer –Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer –Transfer to device is asynchronous (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers 9 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign
10
10 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign void vecAdd(float* h_A, float* h_B, float* h_C, int n) { int size = n * sizeof(float); float* d_A, d_B, d_C; 1. // Transfer A and B to device memory cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &d_C, size); 2. // Kernel invocation code – to be shown later … 3. // Transfer C from device to host cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); }
11
11 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int vectAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel >>(d_A, d_B, d_C, n); } Example: Vector Addition Kernel Device Code
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddkernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernnel >>(d_A, d_B, d_C, n); } Host Code 12
13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign More on Kernel Launch int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernnel >>(A_d, B_d, C_d, n); } Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking Host Code 13 This makes sure that there are enough threads to cover all elements.
14
__host__ Void vecAdd() { dim3 DimGrid = (ceil(n/256.0),1,1); dim3 DimBlock = (256,1,1); vecAddKernel >>(A_ d,B_d,C_d,n); } Kernel execution in a nutshell Kernel Blk 0 Blk N-1 GPU M0 RAM Mk Schedule onto multiprocessors © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 14 __global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; }
15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign More on CUDA Function Declarations host __host__ float HostFunc() hostdevice __global__ void KernelFunc() device __device__ float DeviceFunc() Only callable from the: Executed on the: __global__ defines a kernel function Each “__” consists of two underscore characters A kernel function must return void __device__ and __host__ can be used together 15
16
LET’S TAKE A LOOK AT A REAL PIECE OF CUDA CODE © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 16
17
QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.