L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project.

Slides:



Advertisements
Similar presentations
L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.
Advertisements

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.
GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,
L9: Next Assignment, Project and Floating Point Issues CS6963.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
L5: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963.
L10: Dense Linear Algebra on GPUs
L11: Dense Linear Algebra on GPUs, cont.
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
L15: CUDA, cont. Memory Hierarchy and Examples
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Presentation transcript:

L10: Dense Linear Algebra on GPUs CS6963

Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project proposals (discussed today) – Due 5PM, Wednesday, March 17 (hard deadline) – A few minutes at end of class to help form groups CS L10: Dense Linear Algebra

Outline Triangular solve assignment Project – Ideas on how to approach – Construct list of questions Reading: Paper: Volkov, V., and Demmel, J. W Benchmarking GPUs to tune dense linear algebra, SC08, November 2008.Benchmarking GPUs to tune dense linear algebra Paper link: Talk link: sc08talk.pdf Volkov code: &p=314014&#entry &p=314014&#entry CS L10: Dense Linear Algebra

Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See: 4 L10: Dense Linear Algebra CS6963

Assignment Details: – Integrated with simpleCUBLAS test in SDK – Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS 2.0 library 5 L10: Dense Linear Algebra CS6963

Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different values - Complex mapping or load imbalance 6 L10: Dense Linear Algebra CS6963

Reminder: Outcomes from Last Year’s Course Paper and poster at Symposium on Application Accelerators for High-Performance Computing (May 4, 2010 submission deadline) – Poster: Assembling Large Mosaics of Electron Microscope Images using GPU - Kannan Venkataraju, Mark Kim, Dan Gerszewski, James R. Anderson, and Mary HallAssembling Large Mosaics of Electron Microscope Images using GPU – Paper: GPU Acceleration of the Generalized Interpolation Material Point Method GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey Poster at NVIDIA Research Summit Poster #47 - Fu, Zhisong, University of Utah (United States) Solving Eikonal Equations on Triangulated Surface Mesh with CUDA Solving Eikonal Equations on Triangulated Surface Mesh with CUDA Posters at Industrial Advisory Board meeting Integrated into Masters theses and PhD dissertations Jobs and internships 7 L10: Dense Linear Algebra CS6963

Projects 2-3 person teams Select project, or I will guide you – From your research – From previous classes – Suggested ideas from faculty, Nvidia (ask me) Example (published): – (see prev slide) Steps 1.Proposal (due Wednesday, March 17) 2.Design Review (in class, April 5 and 7) 3.Poster Presentation (last week of classes) 4.Final Report (due before finals) 8 L10: Dense Linear Algebra CS6963

1. Project Proposal (due 3/17) Proposal Logistics: – Significant implementation, worth 55% of grade – Each person turns in the proposal (should be same as other team members) Proposal: – 3-4 page document (11pt, single-spaced) – Submit with handin program: “handin cs6963 prop ” CS L10: Dense Linear Algebra

Content of Proposal I.Team members: Name and a sentence on expertise for each member II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? -Point to any difficulties you anticipate at present in achieving high speedup CS L10: Dense Linear Algebra

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign Reminder from L5: Matrix Multiplication in C M N P WIDTH / / Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j 11 L10: Dense Linear Algebra

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Tiled Matrix Multiply Using Thread Blocks One block computes one square sub- matrix P sub of size BLOCK_SIZE One thread computes one element of P sub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape M N P P sub BLOCK_SIZE WIDTH BLOCK_SIZE bx tx 01 bsize by ty bsize BLOCK_SIZE WIDTH 12 L10: Dense Linear Algebra

Slide source: Stan Tomov, UTK (see 13 L10: Dense Linear Algebra CS6963

Preview: Dense Linear Algebra on GPUs SGEMM result is not from algorithm of L5 Why? Significant reuse can be managed within registers Comparison: GPU Rule-of-ThumbThis Lecture (SGEMM) ThreadingGenerate lots of threads (up to 512/block) to hide memory latency Only 64 threads/block provides 2 warps, sufficient to hide latency plus conserves registers Shared memoryUse to exploit reuse across threads Use to communicate shared data across threads RegistersUse for temporary per- thread data Use to exploit significant reuse within a thread CS6963

Volkov Slides 3-17, L10: Dense Linear Algebra CS6963

for(i=0; i<10; i++)‏ for(j=0; j<10; j++)‏ for(k=0; k<10; k++)‏ c[i][j]+=a[i][k]*b[k][j]; for(j=0; j<10; j++)‏ for(i=0; i<10; i+=2)‏ for(k=0; k<10; k+=2){ c[i][j]+=a[i][k]*b[k][j]; c[i][j]+=a[i][k+1]*b[k+1][j]; c[i+1][j]+=a[i+1][k]*b[k][j]; c[i+1][j]+=a[i+1][k+1]*b[k+1][j]; } source: mxm.c procedure: 0 loop: 0 permute([2,1,3])‏ unroll(0,2,2)‏ unroll(0,3,2)‏ CHiLL script mxm.c (also supports Fortran as input) mxm213.out My Research (with USC/ISI and Argonne): What is CHiLL? High-level loop transformation and code generation framework – based on polyhedral model – script interface for programmers or compilers – optimization strategy expressed as sequence of composable transformations 16 L10: Dense Linear Algebra CS6963

Latest Research: CUDA-CHiLL Automatically generate CUDA for NVIDIA GPU from sequential code plus script Abstraction permits parallel threads & staging of data Heterogeneous code generation: Alternative scripts generate CUDA, OpenMP or sequential code tuned for memory hierarchy 17 L10: Dense Linear Algebra Results provided by Malik Khan, Gabe Rudy, Chun Chen CS6963

float P1[16]; __shared__ float P2[16][17]; bx = blockIdx.x, by = blockIdx.y; tx = threadIdx.x, ty = threadIdx.y; P1[0:15] = c[16*by:16*by+15][tx+64*bx+16*ty]; for (t6 = 0; t10 <= 1008; t6+=16) { P2[tx][4*ty:4*ty+3] = b[16*by+4*ty:16*by+4*ty+3] [tx+t6]; __syncthreads(); P1[0:15] += a[t6][64*bx+16*ty+tx]*P2[0][0:15] P1[0:15] += a[t6+1][64*bx+16*ty+tx]*P2[1][0:15]... P1[0:15] += a[t6+15][64*bx+16*ty+tx]*P2[15][0:15] __syncthreads(); } c[16*by:16*by+15][tx+64*bx+16*ty] = P1[0:15]; CUDA-CHiLL Automatically-Generated Code original() 2 tile(0,1,TI,1,counted) 3 tile(0,3,TJ,2,counted) 4 tile(0,5,TK,3,strided) 5 tile(0,4,TK,4,counted) 6 tile(0,5,1,5,counted) 7 tile(0,5,1,4,counted) 8 datacopy privatized(0,3,c,[4,5],false,-1,1,1) 9 datacopy(0,4,b,false,0,1,-16) 10 tile(3,4,(TJ*TK)/TI,4,counted) 11 tile(3,6,1,4,counted) 12 unroll(1,5,0) 13 unroll(2,5,0) 14 unroll(3,6,0) 15 unroll(0,8,0) 16 unroll(0,9,0) 17 cudaize(mm GPU,0,1,n/TI,n/TJ,n,global) 18 cudathread([0,1],TJ,TI/TJ) 19 cudathread([1,1,0,1],TJ,TI/TJ,sync) 20 cudathread([1,1,1,1],TJ,TI/TJ,sync) 21 cudathread([2,1],TJ,TI/TJ)  script  Automatically-generated CUDA code for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; Results provided by Malik Khan, Gabe Rudy, Chun Chen 18 L10: Dense Linear Algebra CS6963

init("mm.sp2", "MarkedLoop") tile_control({"i","j"}, {TI,TJ}, {l1_control="ii", l2_control="jj"}, {"ii", "jj", "i", "j"}) tile_control({"k"}, {TK}, {l1_control="kk"}, {"ii", "jj", "kk","i", "j","k"}, strided) tile_control({"i"}, {TJ}, {l1_control="ty",l1_tile="tx"}, {"ii", "jj", "kk","tx","ty","j","k"}) --Assign loop levels to thread space and name the kernel cudaize("mm_GPU", {a=N*N, b=N*N, c=N*N},--array sizes for data copying {block={"ii","jj"}, thread={"tx","ty"}}) --Copy the "c" array usage to registers copy_to_registers("kk", "c", {"tx","ty"}) copy_to_shared("ty", "b") --Unroll two innermost loop levels fully unroll_to_depth(2) for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; CUDA-CHiLL: Higher-Level Interface (in progress) 19 L10: Dense Linear Algebra Gabe Rudy Master’s thesis CS6963

Steps 1 through 4 tile (for computation, data) --Copy the "c" array usage to registers 5. copy_to_registers("kk", "c", {"tx","ty"}) 6. copy_to_shared("ty", "b") --Unroll two innermost loop levels fully 7. unroll_to_depth(2) for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[j][i] += a[k][i]*b[j][k]; Final Result: Copy C to registers, B to shared memory and unroll 20 L10: Dense Linear Algebra CS6963 float P1[16]; __shared__ float P2[16][17]; bx = blockIdx.x, by = blockIdx.y; tx = threadIdx.x, ty = threadIdx.y; P1[0:15] = c[16*by:16*by+15][tx+64*bx+16*ty]; for (t6 = 0; t10 <= 1008; t6+=16) { P2[tx][4*ty:4*ty+3] = b[16*by+4*ty:16*by+4*ty+3] [tx+t6]; __syncthreads(); P1[0:15] += a[t6][64*bx+16*ty+tx]*P2[0][0:15] P1[0:15] += a[t6+1][64*bx+16*ty+tx]*P2[1][0:15]... P1[0:15] += a[t6+15][64*bx+16*ty+tx]*P2[15][0:15] __syncthreads(); } c[16*by:16*by+15][tx+64*bx+16*ty] = P1[0:15]; B goes into shared memory C goes into registers and is copied back at end

Next Class See Khan/Rudy poster on Friday! More complex dense linear algebra Some specific global synchronization strategies to support these CS L10: Dense Linear Algebra