Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

ME964 High Performance Computing for Engineering Applications “They have computers, and they may have other weapons of mass destruction.” Janet Reno, former.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Slides from “PMPP” book

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Memory and Data Locality

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Convolution Layer Optimization

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

© David Kirk/NVIDIA and Wen-mei W

Presentation transcript:

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009

Outline Matrix multiplication Implementation Experiments Work plan

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Matrix Multiplication A: M*N B: N*P C=A*B:M*P A B C N M N P

Matrix Multiplication // Matrix multiplication on the (CPU) host void MatrixMulOnHost (float* A, float* B, float* C, int hA, int wA, int wB) { for (int i = 0; i < hA; ++i) { for (int j = 0; j < wB; ++j) { double sum = 0; for (int k = 0; k < wA; ++k) { double a = A[i * wA + k]; double b = B[k * wB + j]; sum += a * b; } P[i * wB + j] = sum; }

Implementation_1 One thread calculates one element of C –dim3 grid(1, 1); – dim3 thread(WC, HC); –__global__ void matrixMul_low( float* C, float* A, float* B, int wA, int wB) –{ –int tx = threadIdx.x; –int ty = threadIdx.y; –float Csub = 0; – for(int k=0; k<wA; ++k) – { – Csub += A[ty*wA+k] * B[k*wB+tx]; – } –C[ty*wB+tx] = Csub; –}

Experiments_ times

Experiments_1

Brief analysis Less efficient than CPU; Data transfer occupies most of the time, each thread –Loads a row of matrix A –Loads a column of matrix B –Perform one multiply and addition for each pair of A and B elements –Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block –1*2*2 is not ok? Try to increase the Compute to off-chip memory access ratio !

Ad Bd Cd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Implementation_2 Tiled Multiply –Each block computes one square sub-matrix Pd sub of size TILE_BLOCK_SIZE –Each thread computes one element of C sub –Assume that the dimensions of A and B are multiples of TILE_BLOCK_SIZE

Implementation_2 dim3 thread(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(WC/thread.x, HC/thread.y); In kernel function –__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; –__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; –//Load the matrices from device memory to shared memory –AS(ty, tx) = A[a + ty*wA + tx]; –BS(ty, tx) = B[b + ty*wB + tx]; – //Synchronize to make sure the matrices are loaded –__syncthreads(); –for(int k=0; k<BLOCK_SIZE; ++k) –{ –Csub += AS(ty,k)*BS(k,tx); –} –__syncthreads(); – int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; –C[c + wB *ty +tx] = Csub;

Experiments_2 Improvement by tile

Experiments_ times

Experiments_2 Thanks for your listening

Experiments_2

Improvement by GPU compared with CPU

Experiments_2

WA, HA, WB GPU CPU Comput time (ms)total time (ms) 16,16,16 GPU CPU ,32,32 GPU CPU ,80,128 GPU CPU ,256,512 GPU CPU ,512,512 GPU CPU

Brief analysis Using shared memory to increase Compute to off-chip memory access ratio –256 access, (16+16)*16*16 computations. Data transfer still occupies much time –Coalesced accesses

Implementation_3 Transpose matrix B –Then read B is the same as read A; –C[i, j] = ∑ A[i, k]*B[j, k];

Experiments_3 Coalesced accessesImplementation_2

Brief analysis No big change –Review the code –Try a new method~

Work plan Further experiments on Matrix Multiplication Learn Reduction