Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering
Contents Introduction Introduction Usage of Matrix Multiplication Usage of Matrix Multiplication Adjacency matrix and graph path Adjacency matrix and graph path More Applications More Applications Implementation in CUDA Implementation in CUDA CUDA Programming Model CUDA Programming Model Simple implementation Simple implementation Introduction to the shared memory structure Introduction to the shared memory structure Shared memory implementation Shared memory implementation Summary Summary
Introduction
Counting Paths in Graphs Counting paths in Graphs Counting paths in Graphs “How many paths are there from C to J of length exactly four?” “How many paths are there from C to J of length exactly four?” This graphing problem is utilized by This graphing problem is utilized by Analyzing transportation networks Analyzing transportation networks DNA Sequence comparison DNA Sequence comparison Drug design Drug design
Counting Paths in Graphs Counting paths in Graphs Counting paths in Graphs Systematic way to find it involves constructing a matrix with each row/column corresponds to start- end point vertex Systematic way to find it involves constructing a matrix with each row/column corresponds to start- end point vertex As a start, construct path length 1 matrix P. As a start, construct path length 1 matrix P.
Adjacency matrix and graph path For matrix of higher path length, we notice that path of length 2 can be constructed using two path with length 1 For matrix of higher path length, we notice that path of length 2 can be constructed using two path with length 1 By observation, we notice that taking a matrix multiplication gives path matrix with length of higher one. By observation, we notice that taking a matrix multiplication gives path matrix with length of higher one.
Adjacency matrix and graph path We notice the first question is just rendered to multiplying adjacency matrix four times to get answer. We notice the first question is just rendered to multiplying adjacency matrix four times to get answer.
More applications!
Implementation in CUDA
Basics..
CUDA Programming model: Thread indexing Each thread have internal struct variables to locate its position Each thread have internal struct variables to locate its position
CUDA Programming model: Thread indexing (col,row) col=threadIdx.x + blockIdx.x*blockDim.x row=threadIdx.y + bloxkIdx.y*blockDim.y Global linear memory index, idx=row*col+col
CUDA Programming model: Kernel call In CUDA, launching a kernel requires specifying three things: In CUDA, launching a kernel requires specifying three things: Dimensions of the grid Dimensions of the grid Dimensions of the blocks Dimensions of the blocks Kernel function to run on the device Kernel function to run on the device Dimensions of the grid and blocks are put in as dim3 variables Dimensions of the grid and blocks are put in as dim3 variables Unsigned int is treated as 1-Dim dim3 variable Unsigned int is treated as 1-Dim dim3 variable If constructor with variable less than 3 is entered, other dimensions are treated as 0 If constructor with variable less than 3 is entered, other dimensions are treated as 0
Simple Implementation… __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { float Cvalue = 0.0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(row > A.height || col > B.width) return; for (int e = 0; e < A.width; ++e) Cvalue += (A.elements[row * A.width + e]) * (B.elements[e * B.width + col]); C.elements[row * C.width + col] = Cvalue; } Index rule Multiplication sum
Simple Implementation… Full codes are available at as multNoShare Full codes are available at as multNoShare materials/UPModules/matrixMultiplication/module Document.pdf materials/UPModules/matrixMultiplication/module Document.pdf Quick outline of code: Quick outline of code: Main(): generates two random matrix with dimension read at the command-line. Calls MatMul() to multiply two matrix Main(): generates two random matrix with dimension read at the command-line. Calls MatMul() to multiply two matrix MatMul(Matrix A, Matrix B, Matrix C): Take two matrices A, B as input, then fills the C with the product. Copies A, B into device global memory, then calls kernel. Act as interface with host – device. MatMul(Matrix A, Matrix B, Matrix C): Take two matrices A, B as input, then fills the C with the product. Copies A, B into device global memory, then calls kernel. Act as interface with host – device. MatMulKernel(Matrix A, Matrix B, Matrix C): do actual computation MatMulKernel(Matrix A, Matrix B, Matrix C): do actual computation
Improvement is available… Global memory access is slow Global memory access is slow Hundreds of clock cycle run idle in each global memory access Hundreds of clock cycle run idle in each global memory access Shared memory responses faster than the global memory Shared memory responses faster than the global memory Shared memory is limited… Shared memory is limited… 16kB per multiprocessor in 1.x device 16kB per multiprocessor in 1.x device Memory conserving strategy have to be implemented! Memory conserving strategy have to be implemented!
Memory conserving strategy Build submatrix to store part of A, B and C Build submatrix to store part of A, B and C Here, for simplicity, Matrix are assumed to have multiples of size of BLOCK_SIZE Here, for simplicity, Matrix are assumed to have multiples of size of BLOCK_SIZE When addition in every element of the submatrix is finished, load next submatrix then sum them When addition in every element of the submatrix is finished, load next submatrix then sum them
Shared memory implementation Matrix multiplication kernel using shared memory Matrix multiplication kernel using shared memory
Shared memory implementation
Performance benefit NoSharedShared Shared memory had shown performance benefit of order ~2 Shared memory had shown performance benefit of order ~2
Be careful! Thread 1 … Thread n-1 Thread 0 Thread n-2 __syncthreads() ……
Summary Matrix multiplication is fundamental of many applications Matrix multiplication is fundamental of many applications CUDA C structure can speed up its calculation by using SIMT Architecture CUDA C structure can speed up its calculation by using SIMT Architecture Keep in mind that thread indexing rule have to be treated carefully Keep in mind that thread indexing rule have to be treated carefully Shared memory usage enhances its performance reducing the access time Shared memory usage enhances its performance reducing the access time Caution : Shared memory is limited Caution : Shared memory is limited Its usage may require synchronization of the threads Its usage may require synchronization of the threads