Matrix Multiplication in CUDA 20165132 Kyeo-Reh Park20165132 Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Slides:

Advertisements

Similar presentations

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Advertisements

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

CUDA Grids, Blocks, and Threads

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CS/EE 217 – GPU Architecture and Parallel Programming

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides from “PMPP” book

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

CUDA Grids, Blocks, and Threads

Parallel Computation Patterns (Scan)

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Parallel Computation Patterns (Reduction)

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Memory and Data Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Grids, Blocks, and Threads

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Parallel Computation Patterns (Stencil)

Convolution Layer Optimization

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Presentation transcript:

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering

Contents Introduction Introduction Usage of Matrix Multiplication Usage of Matrix Multiplication Adjacency matrix and graph path Adjacency matrix and graph path More Applications More Applications Implementation in CUDA Implementation in CUDA CUDA Programming Model CUDA Programming Model Simple implementation Simple implementation Introduction to the shared memory structure Introduction to the shared memory structure Shared memory implementation Shared memory implementation Summary Summary

Introduction

Counting Paths in Graphs Counting paths in Graphs Counting paths in Graphs “How many paths are there from C to J of length exactly four?” “How many paths are there from C to J of length exactly four?” This graphing problem is utilized by This graphing problem is utilized by Analyzing transportation networks Analyzing transportation networks DNA Sequence comparison DNA Sequence comparison Drug design Drug design

Counting Paths in Graphs Counting paths in Graphs Counting paths in Graphs Systematic way to find it involves constructing a matrix with each row/column corresponds to start- end point vertex Systematic way to find it involves constructing a matrix with each row/column corresponds to start- end point vertex As a start, construct path length 1 matrix P. As a start, construct path length 1 matrix P.

Adjacency matrix and graph path For matrix of higher path length, we notice that path of length 2 can be constructed using two path with length 1 For matrix of higher path length, we notice that path of length 2 can be constructed using two path with length 1 By observation, we notice that taking a matrix multiplication gives path matrix with length of higher one. By observation, we notice that taking a matrix multiplication gives path matrix with length of higher one.

Adjacency matrix and graph path We notice the first question is just rendered to multiplying adjacency matrix four times to get answer. We notice the first question is just rendered to multiplying adjacency matrix four times to get answer.

More applications!

Implementation in CUDA

Basics..

CUDA Programming model: Thread indexing Each thread have internal struct variables to locate its position Each thread have internal struct variables to locate its position

CUDA Programming model: Thread indexing (col,row) col=threadIdx.x + blockIdx.x*blockDim.x row=threadIdx.y + bloxkIdx.y*blockDim.y Global linear memory index, idx=row*col+col

CUDA Programming model: Kernel call In CUDA, launching a kernel requires specifying three things: In CUDA, launching a kernel requires specifying three things: Dimensions of the grid Dimensions of the grid Dimensions of the blocks Dimensions of the blocks Kernel function to run on the device Kernel function to run on the device Dimensions of the grid and blocks are put in as dim3 variables Dimensions of the grid and blocks are put in as dim3 variables Unsigned int is treated as 1-Dim dim3 variable Unsigned int is treated as 1-Dim dim3 variable If constructor with variable less than 3 is entered, other dimensions are treated as 0 If constructor with variable less than 3 is entered, other dimensions are treated as 0

Simple Implementation… __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { float Cvalue = 0.0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(row > A.height || col > B.width) return; for (int e = 0; e < A.width; ++e) Cvalue += (A.elements[row * A.width + e]) * (B.elements[e * B.width + col]); C.elements[row * C.width + col] = Cvalue; } Index rule Multiplication sum

Simple Implementation… Full codes are available at as multNoShare Full codes are available at as multNoShare materials/UPModules/matrixMultiplication/module Document.pdf materials/UPModules/matrixMultiplication/module Document.pdf Quick outline of code: Quick outline of code: Main(): generates two random matrix with dimension read at the command-line. Calls MatMul() to multiply two matrix Main(): generates two random matrix with dimension read at the command-line. Calls MatMul() to multiply two matrix MatMul(Matrix A, Matrix B, Matrix C): Take two matrices A, B as input, then fills the C with the product. Copies A, B into device global memory, then calls kernel. Act as interface with host – device. MatMul(Matrix A, Matrix B, Matrix C): Take two matrices A, B as input, then fills the C with the product. Copies A, B into device global memory, then calls kernel. Act as interface with host – device. MatMulKernel(Matrix A, Matrix B, Matrix C): do actual computation MatMulKernel(Matrix A, Matrix B, Matrix C): do actual computation

Improvement is available… Global memory access is slow Global memory access is slow Hundreds of clock cycle run idle in each global memory access Hundreds of clock cycle run idle in each global memory access Shared memory responses faster than the global memory Shared memory responses faster than the global memory Shared memory is limited… Shared memory is limited… 16kB per multiprocessor in 1.x device 16kB per multiprocessor in 1.x device Memory conserving strategy have to be implemented! Memory conserving strategy have to be implemented!

Memory conserving strategy Build submatrix to store part of A, B and C Build submatrix to store part of A, B and C Here, for simplicity, Matrix are assumed to have multiples of size of BLOCK_SIZE Here, for simplicity, Matrix are assumed to have multiples of size of BLOCK_SIZE When addition in every element of the submatrix is finished, load next submatrix then sum them When addition in every element of the submatrix is finished, load next submatrix then sum them

Shared memory implementation Matrix multiplication kernel using shared memory Matrix multiplication kernel using shared memory

Shared memory implementation

Performance benefit NoSharedShared Shared memory had shown performance benefit of order ~2 Shared memory had shown performance benefit of order ~2

Be careful! Thread 1 … Thread n-1 Thread 0 Thread n-2 __syncthreads() ……

Summary Matrix multiplication is fundamental of many applications Matrix multiplication is fundamental of many applications CUDA C structure can speed up its calculation by using SIMT Architecture CUDA C structure can speed up its calculation by using SIMT Architecture Keep in mind that thread indexing rule have to be treated carefully Keep in mind that thread indexing rule have to be treated carefully Shared memory usage enhances its performance reducing the access time Shared memory usage enhances its performance reducing the access time Caution : Shared memory is limited Caution : Shared memory is limited Its usage may require synchronization of the threads Its usage may require synchronization of the threads