L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

L10/L11: Sparse Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab.

L21: “Irregular” Graph Algorithms November 11, 2010.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Chapter 04 Authors: John Hennessy & David Patterson.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

L20: Introduction to “Irregular” Algorithms November 9, 2010.

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

JAVA AND MATRIX COMPUTATION

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

IMP: Indirect Memory Prefetcher

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Irregular Applications –Sparse Matrix Vector Multiplication

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

My Coordinates Office EM G.27 contact time:

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.

Appendix C Graphics and Computing GPUs

Scalpel: Customizing DNN Pruning to the

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

CS427 Multicore Architecture and Parallel Computing

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

L11: Dense Linear Algebra on GPUs, cont.

University of Michigan, Ann Arbor

Linchuan Chen, Peng Jiang and Gagan Agrawal

DRAM Bandwidth Slide credit: Slides adapted from

L4: Memory Hierarchy Optimization II, Locality and Data Placement

EE 4xx: Computer Architecture and Performance Programming

Review for Midterm.

Dense Linear Algebra (Data Distributions)

© 2012 Elsevier, Inc. All rights reserved.

6- General Purpose GPU Programming

Presentation transcript:

L11: Sparse Linear Algebra on GPUs CS6963

Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3 ” Project proposals – Due 5PM, Wednesday, March 7 (hard deadline) – handin cs6963 prop CS L12: Sparse Linear Algebra

Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See: 3 L11: Dense Linear Algebra CS6963

A Few Details C stores multi-dimensional arrays in row major order Fortran (and MATLAB) stores multi- dimensional arrays in column major order – Confusion alert: BLAS libraries were designed for FORTRAN codes, so column major order is implicit in CUBLAS! 4 L11: Dense Linear Algebra CS6963

Dependences in STRSM for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Which loop(s) “carry” dependences? Which loop(s) is(are) safe to execute in parallel? 5 L11: Dense Linear Algebra CS6963

Assignment Details: – Integrated with simpleCUBLAS test in SDK – Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS library 6 L11: Dense Linear Algebra CS6963

Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different values - Complex mapping or load imbalance 7 L11: Dense Linear Algebra CS6963

Outline Next assignment For your projects: – “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU”, Lee et al., ISCA Sparse Linear Algebra Readings: – “Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors,” Bell and Garland (Nvidia), SC09, Nov – “Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs”, Choi, Singh, Vuduc, PPoPP 10, Jan – “Optimizing sparse matrix-vector multiply on emerging multicore platforms,” Journal of Parallel Computing, 35(3): , March (Expanded from SC07 paper.) CS L11: Sparse Linear Algebra

Overview: CPU and GPU Comparisons Many projects will compare speedup over a sequential CPU implementation – Ok for this class, but not for a research contribution Is your CPU implementation as “smart” as your GPU implementation? – Parallel? – Manages memory hierarchy? – Minimizes synchronization or accesses to global memory?

The Comparison Architectures – Intel i7, quad-core, 3.2GHz, 2-way hyper- threading, SSE, 32KB L1, 256KB L2, 8MB L3 – Same i7 with Nvidia GTX 280 Workload – 14 benchmarks, some from the GPU literature

Architectural Comparison Core i7-960GTX280 Number PEs430 Frequency (GHz) Number Transistors0.7B1.4B BW (GB/sec)32141 SP SIMD width48 DP SIMD width21 Peak SP Scalar FLOPS (GFLOPS) Peak SP SIMD Flops (GFLOPS) /933.1 Peak DP SIMD Flops (GFLOPS)

Workload Summary

Performance Results

CPU optimization Tile for cache utilization SIMD execution on multimedia extensions Multi-threaded, beyond number of cores Data reorganization to improve SIMD performance

Sparse Linear Algebra Suppose you are applying matrix-vector multiply and the matrix has lots of zero elements – Computation cost? Space requirements? General sparse matrix representation concepts – Primarily only represent the nonzero data values – Auxiliary data structures describe placement of nonzeros in “dense matrix” 15 L11: Sparse Linear Algebra CS6963

GPU Challenges Computation partitioning? Memory access patterns? Parallel reduction BUT, good news is that sparse linear algebra performs TERRIBLY on conventional architectures, so poor baseline leads to improvements! 16 L12: Sparse Linear Algebra CS6963

Some common representations [] A = data = * 1 7 * * [] 1 7 * 2 8 * * [] 0 1 * 1 2 * * [] offsets = [-2 0 1] data =indices = ptr = [ ] indices = [ ] data = [ ] row = [ ] indices = [ ] data = [ ] DIA: Store elements along a set of diagonals. Compressed Sparse Row (CSR): Store only nonzero elements, with “ptr” to beginning of each row and “indices” representing column. ELL: Store a set of K elements per row and pad as needed. Best suited when number non-zeros roughly consistent across rows. COO: Store nonzero elements and their corresponding “coordinates”.

CSR Example for (j=0; j<nr; j++) { for (k = ptr[j]; k<ptr[j+1]-1; k++) t[j] = t[j] + data[k] * x[indices[k]]; 18 L11: Sparse Linear Algebra CS6963

Summary of Representation and Implementation Bytes/Flop Kernel Granularity Coalescing 32-bit 64-bit DIA thread : row full 4 8 ELL thread : row full 6 10 CSR(s) thread : row rare 6 10 CSR(v) warp : row partial 6 10 COO thread : nonz full 8 12 HYB thread : row full 6 10 Table 1 from Bell/Garland: Summary of SpMV kernel properties. 19 L12: Sparse Linear Algebra CS6963

Other Representation Examples Blocked CSR – Represent non-zeros as a set of blocks, usually of fixed size – Within each block, treat as dense and pad block with zeros – Block looks like standard matvec – So performs well for blocks of decent size Hybrid ELL and COO – Find a “K” value that works for most of matrix – Use COO for rows with more nonzeros (or even significantly fewer) 20 L11: Sparse Linear Algebra CS6963

Stencil Example What is a 3-point stencil? 5-point stencil? 7-point? 9-point? 27-point? Examples: a[i] = (b[i-1] + b[i+1])/2; a[i][j] = (b[i-1][j] + b[i+1][j] + b[i][j-1] + b[i][j+1])/4; How is this represented by a sparse matrix? 21 L11: Sparse Linear Algebra CS6963

Stencil Result (structured matrices) See Figures 11 and 12, Bell and Garland 22 L11: Sparse Linear Algebra CS6963

Unstructured Matrices See Figures 13 and 14 Note that graphs can also be represented as sparse matrices. What is an adjacency matrix? 23 L11: Sparse Linear Algebra CS6963

PPoPP paper What if you customize the representation to the problem? Additional global data structure modifications (like blocked representation)? Strategy – Apply models and autotuning to identify best solution for each application 24 L11: Sparse Linear Algebra CS6963

Summary of Results BELLPACK (blocked ELLPACK) achieves up to 29 Gflop/s in SP and 15.7 Gflop/s in DP Up to 1.8x and 1.5x improvement over Bell and Garland. 25 L11: Sparse Linear Algebra CS6963

This Lecture Exposure to the issues in a sparse matrix vector computation on GPUs A set of implementations and their expected performance A little on how to improve performance through application-specific knowledge and customization of sparse matrix representation 26 L11: Sparse Linear Algebra CS6963

What’s coming Next time: Application case study from Kirk and Hwu (Ch. 8, real-time MRI) Wednesday, March 2: two guest speakers from last year’s class – BOTH use sparse matrix representation! – Shreyas Ramalingam: program analysis on GPUs – Pascal Grosset: graph coloring on GPUs CS L11: Sparse Linear Algebra