Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Lecture 6: Multicore Systems

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Parallel computer architecture classification

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Symmetric Eigensolvers in Sca/LAPACK Osni Marques

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Performance and Energy Efficiency of GPUs and FPGAs

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Sunpyo Hong, Hyesoon Kim

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

My Coordinates Office EM G.27 contact time:

The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Appendix C Graphics and Computing GPUs

Ioannis E. Venetis Department of Computer Engineering and Informatics

High-Performance Matrix Multiplication

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 5: GPU Compute Architecture

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Lecture 5: GPU Compute Architecture for the last time

Memory System Performance Chapter 3

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Linear Algebra on GPUs Vasily Volkov

GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use vector thread program model instead – Similar to SSE/Cell SPE/vector units, not superscalar units – Native vector length is 32; can simulate larger (thread blocks) Multithreading using register windows – Context switch never flushes registers to memory – If more threads than can fit, some don’t start until others finish Huge register files, small high-latency caches – Fundamental difference with CPUs, similar to vector processors – Cache is to save memory bandwidth, not reduce latency – Use register blocking, not cache blocking Large startup costs (≈5  s) – Not good for fine-grain computations – Use global barriers instead? (≈1.5  s)

Relation to Other Multicores ProcessorCore2 SSE, 2.66GHzCell SPEsG80/G84 Gflop/s/core21.3 Gflop/s25.6 Gflop/s19-23 Gflop/s Vector length4432 # of cores All offer both multithreading and SIMD capabilities Use CUDA to program all of them?

Pointer Chase Benchmark, 8800GTX 8-word cache line L1: 8 x 5kB, each is 20-way associative L2: 6 x 32kB, each is 4-way associative 512kB memory pages TLB: 16 entries, fully associative run k=A[k] in a loop in a scalar thread latency bound larger latency at cache miss  Reveals cache sizes 8800GTX/8800GTS/8600GTS/FX5600: Different number of similar caches

GPU Memory System (8800GTX)

Matrix-Matrix multiply, C = C + AB 64x16 blocks in C, rank-4 updates – Ensures that our code is compute-bound Each thread computes one block in C – ijk/jik form; other choices produce race condition in C Keep A’s and C’s block in vector registers – Similarly done on IBM 3090 VF and Cray X1 Keep B’s block in local memory – Others keep it in scalar registers (n/a on GPU); or caches (slow) Use compiler options to enforce tight register budget – To hide memory latencies better by multithreading Use prefetching to hide latencies even better – Now, performance is not bound by latency and bandwidth in reading blocks in A and B! – Bound by instruction issue and local memory ops (230 Gflop/s)

Performance of SGEMM CUBLAS 1.1: keeps square blocks in A and B in local memory uses long vectors (poor instruction mix) exposing too much of data parallelism may cost you Our SGEMM is now in CUBLAS 2.0 beta

SGEMM, 8800GTX, k = 1024 Constant work per vector thread (function of k) Optimized version does better load balancing by computing partial sums

Panel Factorization CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!) GPU: estimated for 8800GTX as

Design of Matrix Factorizations Right-looking scheme = most parallelism = best on 16-core GPUs Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS 1.1 Left-looking = half of work in triangular solve = limited parallelism = inefficient 2-level blocking – Both levels are right-looking + premature exit from finer level to keep – Up to 6% speedup only, at large matrices (n≈10,000) Block size on GPU is 64 (same as in matrix multiply) – Autotuning in QR (up to 7% speedup) Row-major layout on GPU in LU decomposition – Since gathers with large stride are inefficient – Requires transposition at every CPU-GPU transfer – >2x speedup! Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead) Multiply by inverse (GEMM) instead of triangular solve (TRSM) – TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64 – Parallelism in TRSM is not embarrassing enough

Test Platform GPU – Core2 Duo 2.67GHz + GeForce 8800 GTX – Core2 Duo 2.67GHz + two GeForce 8800 GTX CPU – Core2 Duo 2.67GHz – Core2 Quad 2.4GHz

Summary of Performance

Speedups vs. CPU

Summary of Speedups 8800GTX Gflop/s Core2 DuoCore2 Quad Gflop/sspeedupGflop/sspeedup Cholesky   LU   QR   SGEMM  

Speedup using 2 GPUs Using column-cyclic layout

Breakdown of runtime (LU)

What if omit one optimization in LU?

Other Work Done Tridiagonal eigenvalue solver (bisection) – Most work: factorize A–  i I = LDL T, count signs in D (compute bound) – Done for many  i in parallel — traditionally vectorized – If need more parallelism — do multisection instead of bisection But it increases total flop count – Rest is difficult to parallelize, does not worth it – Our solution: Run vectorized loops on the GPU, rest (least work) on the CPU Autotune to decide optimal redundancy and when involve CPU Use features of IEEE arithmetic to save another 15-30% of runtime Up to 156x faster than LAPACK on Pentium 4 Tridiagonal eigenvector solver (inverse iteration) – Most work: Solve (A– i I)x k+1 =x k for fixed i (bandwidth bound) – Factorize A– i I = LDL T once, keep D only. Reconstruct L on need Reconstruction is overlapped with memory access; still bandwidth bound – Don’t pivot — recompute using safe code if fails (do it on CPU) – Up to 120x faster than LAPACK on Core2 Duo so far – More complicated when eigenvalues are clustered Stencil computation (7-point on 3D grid) – Blocks in registers and local memory – Bandwidth-bound, runs at up to 66% of pin-bandwidth

Future Work Analysis of architecture – Find best parallels in past architectures to reuse methods – Catching up with newer GPUs – More micro-benchmarking to get better performance models More scientific kernels – CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop) More LAPACK – Two-sided factorizations used in eigensolvers and SVD LAPACK does 50% of work is in BLAS1/BLAS2 Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed May use divide-and-conquer instead – MRRR (improved inverse iteration algorithm, also rich in parallelism) – Non-symmetric eigensolvers such as QR iterations currently fine-grained, can do better? – Iterative refinement for eigenvalue problem? ScaLAPACK (distributed memory LAPACK) – One-sided factorizations on a GPU cluster