ME964 High Performance Computing for Engineering Applications

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 2: The CUDA Programming Model.
ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2015 University of Wisconsin-Madison Lecture notes based on slides created.
GPU Computing with CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
ME964 High Performance Computing for Engineering Applications Most of the time I don't have much fun. The rest of the time I don't have any fun at all.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Fall 2007 Illinois ECE 498AL1 Programming Massively Parallel.
Lecture 8-2 : CUDA Programming Slide Courtesy : Dr. David Kirk and Dr. Wen-Mei Hwu and Mulphy Stein.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE/CS 757: Advanced Computer Architecture II GPGPUs
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
ECE 498AL Lecture 2: The CUDA Programming Model
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ME964 High Performance Computing for Engineering Applications CUDA API Sept. 18, 2008

Before we get started… Today Last Time The CUDA API Start discussing CUDA programming model Today HW2 is due on Friday, 11:59 PM HW3 available on the class website (due Sept.25) A look at a matrix multiplication example The CUDA execution model 2

Going back to the G80 HW… split personality n. Two distinct personalities in the same entity, each of which prevails at a particular time. 3 HK-UIUC

Calling a Kernel Function, and the Concept of Execution Configuration Last slide of previous lecture Calling a Kernel Function, and the Concept of Execution Configuration A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); // declaration dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking 4 HK-UIUC

G80 Thread Computing Pipeline The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host 5 HK-UIUC

Simple Example: Matrix Multiplication A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local variable and register usage Thread ID usage Memory data transfer API between host and device 6 HK-UIUC

Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Software Design Decisions: One thread handles one element of P Each thread will access entries in M and N WIDTH times from global memory WIDTH M P WIDTH WIDTH WIDTH HK-UIUC 7

Step 1: Matrix Data Transfers (Host Side) // Allocate the device memory where we will copy M to // Although not shown, you do the same for N and P matrices Matrix Md; #define WIDTH 16 Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size); // Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); ...//do your work here… (see next slides) // Read the result from the device to the host into P cudaMemcpy(P.elements, Pd.elements, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Md.elements); 8 HK-UIUC

Step 2: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision; void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; 9 HK-UIUC

Step 3: Matrix Multiplication, Host-side. Main Program Code int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } 10 HK-UIUC

Step 3: Matrix Multiplication Host-side code // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // clear memory // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } Continue here… Continue here… 11 HK-UIUC

Multiply Using One Thread Block Grid 1 One Block of threads computes matrix P Each thread computes one element of P Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each pair of M and N elements Compute to off-chip memory access ratio close to 1:1 Not that good… Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2) 48 M P BLOCK_SIZE 12 HK-UIUC

Step 4: Matrix Multiplication- Device-side Kernel Function // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; N WIDTH M P tx WIDTH ty 13 HK-UIUC WIDTH WIDTH

Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); 14 HK-UIUC

Step 6: Handling Arbitrary Sized Square Matrices Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix Each has (BLOCK_WIDTH)2 threads Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks N WIDTH P M by NOTE: You still need to put a loop around the kernel call for cases where WIDTH is really large ty WIDTH bx tx WIDTH WIDTH 15 HK-UIUC

The Common Pattern to CUDA Programming Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU Phase 2: Let the GPU crunch the numbers for you based on the kernel that you define Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done. 16

A Common Programming Pattern (Cntd A Common Programming Pattern (Cntd.) BRINGING THE SHARED MEMORY INTO THE PICTURE Local and global memory reside in device memory (DRAM) - much slower access than shared memory An advantageous way of performing computation on the device is to block data to take advantage of fast shared memory: Partition data into data subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory back to global memory 17 HK-UIUC

How about the memory not in registers or shared? Device Stream Multiprocessor N Stream Multiprocessor 2 Stream Multiprocessor 1 Device memory Shared Memory Instruction Unit Processor 1 Registers … Processor 2 Processor M Constant Cache Texture The local, global, constant, and texture spaces are regions of device memory Beyond registers and shared memory, each multiprocessor has: Read/Write global memory There is a lot of this… A read-only constant cache To speed up access to the constant memory space A read-only texture cache To speed up access to the texture memory space Global, constant, texture memories 18

A Common Programming Pattern (cont.) Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/O no structure  constant memory R/O array structured  texture memory R/W shared within Block  shared memory R/W registers spill to local memory R/W inputs/results  global memory 19 HK-UIUC

G80 Hardware Implementation: A Set of SIMD Multiprocessors The device is a set of Stream Multiprocessors (14 on 8800 GT, but that’s totally irrelevant) Each multiprocessor has a collection of 32-bit Scalar Processors with a Single Instruction Multiple Data architecture – shared instruction unit At each clock cycle, a Stream Multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Instruction Unit Scalar Processor 1 … Scalar Processor 2 Scalar Processor M 20 HK-UIUC

Hardware Implementation: Execution Model (review) Each block of a grid is split into warps, each gets executed by one Stream Multiprocessor (SM) The device processes only one grid at a time Each block is executed by one Stream Multiprocessor The shared memory space resides in the on-chip shared memory A Stream Multiprocessor can execute multiple blocks concurrently Shared memory and registers are partitioned among the threads of all concurrent blocks So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently, which is good Currently, up to eight blocks can be processed concurrently (this is bound to change, as the HW changes…) The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread indices with the first warp containing thread 0. 21