L11: Dense Linear Algebra on GPUs, cont.

Slides:

Advertisements

Similar presentations

L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.

Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project.

L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

L10: Dense Linear Algebra on GPUs

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

L18: CUDA, cont. Memory Hierarchy and Examples

GPU Implementations for Finite Element Methods

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

DRAM Bandwidth Slide credit: Slides adapted from

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

L15: CUDA, cont. Memory Hierarchy and Examples

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Review for Midterm.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Convolution Layer Optimization

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

L5: Memory Bandwidth Optimization

6- General Purpose GPU Programming

Optimizing single thread performance

Presentation transcript:

L11: Dense Linear Algebra on GPUs, cont. CS6963

Administrative Issues Next assignment, triangular solve Due 5PM, Friday, March 5 handin cs6963 lab 3 <probfile>” Project proposals (discussed today) Due 5PM, Wednesday, March 17 (hard deadline) handin cs6963 prop <pdffile> A few minutes at end of class to help form groups 2 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Outline Project Suggested topics Triangular solve assignment More on dense linear algebra LU 3 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Project Ideas Suggestions from Nvidia (Nathan Bell via Pete Shirley, Steve Parker): Sparse solvers Tailor computation to matrix structure Optimized SpMV for multiple vectors Block Krylov methods Block eigensolvers Parallel preconditions Parallel Graph Algorithms (Maximal) Independent sets Graph colorings Partitioning of graphs that minimize a given metric (e.g. edge cut) Optimal network flows Graph clustering 4 L11: Dense Linear Algebra

L11: Dense Linear Algebra More Project Ideas Suggestions from Steve Parker: Non-uniform, unstructured Double precision, cluster-based 5 L11: Dense Linear Algebra

Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See: http://www.netlib.org/blas/strsm.f 6 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra A Few Details C stores multi-dimensional arrays in row major order Fortran (and MATLAB) stores multi-dimensional arrays in column major order Confusion alert: BLAS libraries were designed for FORTRAN codes, so column major order is implicit in CUBLAS! 7 L11: Dense Linear Algebra CS6963

SGEMV (single-precision sgemv (matrix vector multiplication) and transposed sgemv Recall transpose vs. column-major assumption // This is transposed due to // Fortran assumption // Follows coalesced accesses // Place b in shared memory for (i = 0; i < n; i++) for (j = 0; j < n; j++) c[j] += a[j*n+i]*b[i]; // Non-transposed // Store a in shared memory // and access in a different order for (i = 0; i < n; i++) for (j = 0; j < n; j++) c[i] += a[i*n+j]*b[j]; Results from Malik Khan Non-transposed more than 2X slower for 8K matrix! Reference: An Optimizing Compiler for GPGPU Programs with Input-Data Sharing, Yang et al., POPL 2010 Poster, Jan. 2010. CS6963

L11: Dense Linear Algebra Dependences in STRSM for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Which loop(s) “carry” dependences? Which loop(s) is(are) safe to execute in parallel? 9 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Assignment Details: Integrated with simpleCUBLAS test in SDK Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS 2.0 library 10 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different <j,k> values - Complex mapping or load imbalance 11 L11: Dense Linear Algebra CS6963

Latest Research: CUDA-CHiLL Automatically generate CUDA for NVIDIA GPU from sequential code plus script Abstraction permits parallel threads & staging of data Heterogeneous code generation: Alternative scripts generate CUDA, OpenMP or sequential code tuned for memory hierarchy Results provided by Malik Khan, Gabe Rudy, Chun Chen 12 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Recent Results Suppose your matrix is not evenly divided by the tile sizes (64 and 16)? 512x512 matrix sgemm: CUBLAS 258 Gflops Our compiler 284 Gflops 500x500 matrix sgemm: CUBLAS 167 Gflops Our compiler 223 Gflops 13 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra CUDA-CHiLL: Higher-Level Interface (in progress) for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; init("mm.sp2", "MarkedLoop") tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) --Copy the "c" array usage to registers copy_to_registers("kk", "c", {"tx","ty"}) copy_to_shared("ty", "b") --Unroll two innermost loop levels fully unroll_to_depth(2) Gabe Rudy Master’s thesis 14 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Breaking it Down: Tiling for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; 1. Tile I and J loops for (ii=0; i<n/TI; i++) for (jj=0; j<n/TJ; j++) for (i = ii*TI; i < min(n,(ii+1)*TJ); i++) for (j = jj*TJ; j < min(n,(jj+1)*TJ); j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; 1. tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) - 15 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Breaking it Down: Tiling for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; 2. Tile k loop (strided, move kk above i&j) for (ii=0; i<n/TI; i++) for (jj=0; j<n/TJ; j++) for (kk = 0; kk < n; kk+=TK) for (i = ii*TI; i < min(n,(ii+1)*TI); i++) for (j = jj*TJ; j < min(n,(jj+1)*TJ); j++) for (k = kk; k < min(n,kk+TK); k++) c[j][i] += a[k][i]*b[j][k]; 1. tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) 2. tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) - 16 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Breaking it Down: Tiling 3. Create tx and ty by further tiling of i for (ii=0; i<n/TI; i++) for (jj=0; j<n/TJ; j++) for (kk = 0; kk < n; kk+=TK) for (tx = fn(i,TJ)) for (ty = fn(i,TJ)) for (j = jj*TJ; j < min(n,(jj+1)*TJ); j++) for (k = kk; k < min(n,(kk+1)*TK); k++) c[j][…] += a[k][…]*b[j][k]; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; 1. tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) 2. tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) 3. tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) - 17 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Breaking it Down: “cudaize” for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; 4. Remove loops representing bx, by, tx, ty for (kk = 0; kk < n; kk+=TK) for (j = jj*TJ; j < min(n,(jj+1)*TJ); j++) for (k = kk; k < min(n,(kk+1)*TK); k++) c[j][?] += a[k][?]*b[j][k]; 1. tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) 2. tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) 3. tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel 4. cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) - 18 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra Final Result: Copy C to registers, B to shared memory and unroll for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; --Copy the "c" array usage to registers 5. copy_to_registers("kk", "c", {"tx","ty"}) 6. copy_to_shared("ty", "b") --Unroll two innermost loop levels fully 7. unroll_to_depth(2) float P1[16]; __shared__ float P2[16][17]; bx = blockIdx.x, by = blockIdx.y; tx = threadIdx.x, ty = threadIdx.y; P1[0:15] = c[16*by:16*by+15][tx+64*bx+16*ty]; for (t6 = 0; t10 <= 1008; t6+=16) { P2[tx][4*ty:4*ty+3] = b[16*by+4*ty:16*by+4*ty+3] [tx+t6]; __syncthreads(); P1[0:15] += a[t6][64*bx+16*ty+tx]*P2[0][0:15] P1[0:15] += a[t6+1][64*bx+16*ty+tx]*P2[1][0:15] ... P1[0:15] += a[t6+15][64*bx+16*ty+tx]*P2[15][0:15] } c[16*by:16*by+15][tx+64*bx+16*ty] = P1[0:15]; B goes into shared memory C goes into registers and is copied back at end A is used from registers, but not reused 19 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra LU Decomposition and Other Matrix Factorizations Volkov talks about matrix factorization algorithms What are these: LU Decomposition example Also, Cholesky and QR in paper 20 L11: Dense Linear Algebra CS6963

L11: Dense Linear Algebra LU Decomposition (no pivoting) DO K=1,N-1 DO I=K+1,N A(I,K)=A(I,K)/A(K,K) DO J=K+1,N A(I,J)=A(I,J)-A(I,K)*A(K,J) LAPACK version decomposes this into three parts: mini-LU, (S/D)TRSM, & (S/D) GEMM Row pivoting (used for numeric stability) reorders matrix and computation 21 L11: Dense Linear Algebra CS6963

What’s Coming See Khan/Rudy poster on Friday! Before Spring Break Two application case studies from newly published Kirk and Hwu Sparse linear algebra Review for Midterm CS6963