L10: Dense Linear Algebra on GPUs

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.
L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.
L5: Memory Hierarchy Optimization III, Data Placement, cont. and Memory Bandwidth Optimizations CS L5: Memory Hierarchy, III.
L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project.
L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
L5: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963.
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
© David Kirk/NVIDIA and Wen-mei W
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
L11: Dense Linear Algebra on GPUs, cont.
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
L15: CUDA, cont. Memory Hierarchy and Examples
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Review for Midterm.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Convolution Layer Optimization
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
L5: Memory Bandwidth Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Presentation transcript:

L10: Dense Linear Algebra on GPUs CS6963

Administrative Issues Next assignment, linear algebra Handed out by Friday Due before spring break handin cs6963 lab 3 <probfile>” 2 L10: Dense Linear Algebra CS6963

L10: Dense Linear Algebra Outline Triangular solve assignment from last year (briefly) Reading: Paper: Volkov, V., and Demmel, J. W. 2008. Benchmarking GPUs to tune dense linear algebra, SC08, November 2008. Paper link: http://portal.acm.org/citation.cfm?id=1413402 Talk link: http://www.eecs.berkeley.edu/~volkov/volkov08-sc08talk.pdf Volkov code: http://forums.nvidia.com/index.php?showtopic=47689&st=40&p=314014&#entry314014 3 L10: Dense Linear Algebra CS6963

Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See: http://www.netlib.org/blas/strsm.f 4 L10: Dense Linear Algebra CS6963

Last Year’s Assignment Details: Integrated with simpleCUBLAS test in SDK Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS 2.0 library 5 L10: Dense Linear Algebra CS6963

Symmetric Matrix Multiply (SSYMM) float a[N][N], b[N][N], c[N][N]; float t1, t2; for (j = 0; j< N; j++) { for (i=0; i<N; i++) { t1 = b[j][i]; t2 = 0; for (k = 0; k<i-1; k++) { c[j][k]+= t1*a[i][k]; t2 = t2 + b[k][j]*a[i][k]; } c([j][i] += t1*a[i][i] + t2; Equivalent to: cublasSsym('l' /* left operator */, ‘u' /* upper triangular */, N, N, 1.0, d_A, N, d_B, N, 1.0, d_C, N); See: http://www.netlib.org/blas/ssymm.f

L10: Dense Linear Algebra Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different <j,k> values - Complex mapping or load imbalance 7 L10: Dense Linear Algebra CS6963

1. Project Proposal (due 3/7) Proposal Logistics: Significant implementation, worth 55% of grade Each person turns in the proposal (should be same as other team members) Proposal: 3-4 page document (11pt, single-spaced) Submit with handin program: “handin cs6963 prop <pdf-file>” 8 L10: Dense Linear Algebra CS6963

Decision Algorithm: Computational Decomposition Block Parallelism Thread Parallelism tile_by_index({“i”}, {TI},{l1_control="ii”},{"ii","i", "j”}) cudaize( block{“ii”}, thread{} ) So how we decide to optimize code. We start with our decision algorithm, identifying the blocks and threads. As being shown, the decision algorithm identifies the loops for computational decomposition on the basis of dependence information. Some of the loops are pruned, and others can lead to multiple options to lay out the grid. We see computational decomposition as a tiling problem where when needed we can tile the loops (as shown) to extract block and thread level parallelism. Once the grid is setup, we proceed with the left over loops to identify the threads. cudaize( block{“ii”}, thread{“i”} ) 9 L10: Dense Linear Algebra

Decision Algorithm: Data Staging Data Reuse across threads Data Staging Shared Memory Registers Cudaize Unrolling tile_by_index({“j”}, {TJ},{l1_control=“jj”},{"ii",”jj”,"i", "j”}) Data Reuse inside thread Final Loop Order In the memory management stage, the decision algorithm will decide on the data structures to go into shared memory and into the registers. After selecting data structure for copy to shared memory, based on ??, once again we may have to tile for the data footprint to fit in the shared memory. The copy to registers is decided on the data reuse pattern. After that the decision algorithm applies cudaize transform to formally inform the system of the grid and block layouts followed by an all important unrolling transformation. (in this example the system is told to unroll upto 1 loop level, can be upto 2/3 like in mm.) cudaize( block{“ii”}, thread{“i”} ) unroll to depth( 1 ) 10 L10: Dense Linear Algebra

L10: Dense Linear Algebra CUDA-CHiLL Recipe N = 1024 TI= TJ = 32 tile_by_index({“i”,”j”}, {TI,TJ},{l1_control="ii”, l2_control=“k”},{"ii", “jj”,"i", "j”}) normalize_index(“i”) cudaize(“mv_GPU”, {a=N, b=N, c=N*N},{block={“ii”}, thread={“i”}}) copy_to_shared(“tx”, “b”, 1) copy_to_registers(“jj”, “a”) unroll_to_depth(1) This is a CUDA-CHILL level recipe which is available to the compilers and programmers for custom use. It basically encapsulates multiple low level CHiLL operations in simple and easy to understand commands. In our example of Matrix-vector, this CUDA-CHiLL recipe specifies, Tiling : We have to tile the loops to identify blocks and thread loops, while the subsequent data staging operation also requires tiling for efficient and valid shared memory use. Cudaize: is a specific transformation to mark the CUDA specific structures like block loops and thread loops. Copy to shared memory : identifies the data structures to be copied to the shared memory and the loop level where the datacopy should take place with appropriate synchronizations. Copy to registers : identifies data structures for reuse inside a thread. Unrolling: unrolls the loops in the transformed code for maximum efficiency. 11 L10: Dense Linear Algebra

Matrix-Vector Multiply: GPU Code Generated Code: with Computational decomposition only. __global__ GPU_MV(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; int i = 32*bx+tx; for (j = 0; j < N; j++) a[i] = a[i] + c[j][i] * b[j]; } Final MV Generated Code: with Data staged in shared memory & registers. __global__ GPU_MV(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; __shared__ float bcpy[32]; float acpy = a[tx + 32 * bx]; for (jj = 0; jj < 32; jj++) { bcpy[tx] = b[32 * jj + tx]; __syncthreads(); //this loop is actually fully unrolled for (j = 32 * jj; j <= 32 * jj + 32; j++) acpy = acpy + c[j][32 * bx + tx] * bcpy [j]; } a[tx + 32 * bx] = acpy; } Performance outperforms CUBLAS 3.2 (see later slide) Explain the slide. 12 L10: Dense Linear Algebra

An added Complication What if input matrix C is transposed? a[i] += c[i][j] * b[j]; What happens to global memory coalescing? What can be done?

Reminder from L5: Matrix Multiplication in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; k j WIDTH M P i WIDTH k WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Tiled Matrix Multiply Using Thread Blocks Psub BLOCK_SIZE WIDTH bx tx 1 bsize-1 2 by ty One block computes one square sub-matrix Psub of size BLOCK_SIZE One thread computes one element of Psub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

L5: Memory Hierarchy, III Strip-Mined Code for (int ii = 0; ii < Width; ii+=TI)‏ for (int i=ii; i<ii*TI-1; i++) for (int jj=0; jj<Width; jj+=TJ) for (int j = jj; j < jj*TJ-1; j++) { double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk*TK-1; k++) sum += M[i][k] * N[k][j]; } P[i][j] = sum; Block dimensions Thread dimensions To be used to stage data in shared memory L5: Memory Hierarchy, III

Final Code (from text, p. 87) __global__ void MatrixMulKernel (float *Md, float *Nd, float *Pd, int Width) { __shared__ float Mds [TILE_WIDTH] [TILE_WIDTH]; __shared__ float Nds [TILE_WIDTH] [TILE_WIDTH]; 3 & 4. int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; //Identify the row and column of the Pd element to work on 5 & 6. int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m=0; m < Width / TILE_WIDTH; ++m) { // Collaborative (parallel) loading of Md and Nd tiles into shared memory 9. Mds [ty] [tx] = Md [Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds [ty] [tx] = Nd [(m*TILE_WIDTH + ty)*Width + Col]; 11. __syncthreads(); // make sure all threads have completed copy before calculation 12. for (int k = 0; k < TILE_WIDTH; ++k) // Update Pvalue for TKxTK tiles in Mds and Nds 13. Pvalue += Mds [ty] [k] * Nds [k] [tx]; 14. __syncthreads(); // make sure calculation complete before copying next tile } // m loop 15. Pd [Row*Width + Col] = Pvalue; } L5: Memory Hierarchy, III

Preview: SGEMM (CUBLAS 2.x/3.x) on GPUs SGEMM result is not from algorithm of L5 Why? Significant reuse can be managed within registers GPU Rule-of-Thumb Lecture (for GTX 280) Lecture (for Fermi C2050 Threading Generate lots of threads (up to 512/block) to hide memory latency Only 64 threads/block provides 2 warps, sufficient to hide latency plus conserves registers More cores/block, fewer registers/thread, so use 96 threads/block Shared memory Use to exploit reuse across threads Communicate shared data across threads and coalesce global data Registers Use for temporary per-thread data Exploit significant reuse within a thread Texture memory Not used Increase bandwidth for global memory through parallel accesses CS6963

L10: Dense Linear Algebra Volkov Slides 5-17, 24 19 L10: Dense Linear Algebra CS6963

Comparison with MKL (Intel) Slide source: http://www.scribd.com/doc/47501296/CUDA-3-2-Math-Libraries-Performance

L10: Dense Linear Algebra CUDA-CHiLL for Matrix Multiply (CUBLAS 2.x version) for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; init("mm.sp2", "MarkedLoop") tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) --Copy the "c" array usage to registers copy_to_registers("kk", "c", {"tx","ty"}) copy_to_shared("ty", "b") --Unroll two innermost loop levels fully unroll_to_depth(2) Gabe Rudy Master’s thesis 21 L10: Dense Linear Algebra CS6963

L10: Dense Linear Algebra Final Result: Copy C to registers, B to shared memory and unroll for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; Steps 1 through 4 tile (for computation, data) --Copy the "c" array usage to registers 5. copy_to_registers("kk", "c", {"tx","ty"}) 6. copy_to_shared("ty", "b") --Unroll two innermost loop levels fully 7. unroll_to_depth(2) float P1[16]; __shared__ float P2[16][17]; bx = blockIdx.x, by = blockIdx.y; tx = threadIdx.x, ty = threadIdx.y; P1[0:15] = c[16*by:16*by+15][tx+64*bx+16*ty]; for (t6 = 0; t10 <= 1008; t6+=16) { P2[tx][4*ty:4*ty+3] = b[16*by+4*ty:16*by+4*ty+3] [tx+t6]; __syncthreads(); P1[0:15] += a[t6][64*bx+16*ty+tx]*P2[0][0:15] P1[0:15] += a[t6+1][64*bx+16*ty+tx]*P2[1][0:15] ... P1[0:15] += a[t6+15][64*bx+16*ty+tx]*P2[15][0:15] } c[16*by:16*by+15][tx+64*bx+16*ty] = P1[0:15]; B goes into shared memory C goes into registers and is copied back at end 22 L10: Dense Linear Algebra CS6963

CUBLAS 3.x Example

2D Convolution: CUDA-CHiLL recipe and optimized code Sequential Code for(i=0;i<N;++i) for(j=0;j<N;++j) for(k=0;k<M;++k) for(l=0;l<M;++l) c[i][j] = c[i][j] + a[k+i][l+j] * b[k][l]; CUDA-CHiLL Recipe N=4096, M=32, TI =32, TJ = 16, Tl=4 permute(0,{"j","i","k","l"}) tile_by_index({"j","i"}, {TI,TJ}, {l1_control="jj", l2_control="ii"}, {"jj", "ii", "j", "i","k","l"}) normalize_index("j") normalize_index("i") cudaize("Kernel_GPU", {a=(N+M)*(N+M), b=M*M, c=(N+M)*(N+M)},{block={"jj","ii"}, thread={"j","i"}}) copy_to_shared("tx","a",-16) copy_to_shared("tx","b",-16) copy_to_registers("tx", "c") Unroll_to_depth(1) Complex bounds for shared memory copy loops Optimized Code __shared__ float (_P1[47])[31]; __shared__ float (_P2[16])[17];float tmp3; … for (tmp = 16 * by + 3 * ty; tmp <= min(16 * by + 30, 16 * by + 3 * ty + 2); tmp++) for (tx1 = 2 * tx; tx1 <= min(2 * tx + 1, 46); tx1++) _P1[ tx1][tmp - 16 * by] = a[tmp][32 * bx + tx1]; __syncthreads(); for (tmp = 0; tmp <= 15; tmp++) for (tx1 = 2 * tx; tx1 <= 2 * tx + 1; tx1++) _P2[tx1][tmp] = b[tmp][tx1]; tmp3 = c[k + 16 * by][tx + 32 * bx]; for (k = 0; k <= 15; k++) for (l = 0; l <= 15; l++) tmp3 = tmp3 + _P1[l + tx ][k + ty] * _P2[l][k]; c[k + 16 * by][tx + 32 * bx] = tmp3; Data structures for shared memory

BLAS Library Kernel Optimization MV Results MM Results Results. TMV Results VV Results 25 L10: Dense Linear Algebra

Stencil Optimization - Transformations Effects 2D Convolution Effect of Texture Memory with Unrolling on MM Effect of Unrolling on MM & MV 26 L10: Dense Linear Algebra

Next Class Sparse linear algebra Appropriate data representations CS6963