Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Single Instruction Multiple Threads

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

6- General Purpose GPU Programming

Presentation transcript:

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Agenda Motivation GPU Technology GPU Optimization issues MAGMA SYMV kernel The new SYMV Kernel Performance Results What Helped us? Future Work

Motivation GPUs are invading HPC community. ◦ Many cores (~512) on a single GPU card. ◦ Best suited for massively (embarrassingly) parallel problem. ◦ Unlike CPUs, dedicate more silicon to floating point operation. ◦ Unlike CPUs, consume much less power. Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) ◦ The world’s biggest supercomputer to be built will have 18,000 GPUs To get high performance, it is quite a challenge

GPU Technology (Fermi) SM L2-Cache DRAM

GPU Technology (Fermi) For each SM ◦ 32 cores. ◦ 64K L1/SHMEM ◦ 16 LS/ST units ◦ 4 SFUs ◦ registers (32-bits)

GPU Technology (Fermi) Fermi GPUs are the first GPU in the world with complete memory hierarchy ◦ (registers, L1 cache/SHMEM, L2 cache, DRAM) Fermi is the first GPU with ECC support. Fermi theoretical peak performance: ◦ 1 Tflop/s (single precision) ◦ ~ 500 Gflop/s (double precision)

GPU Technology Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory

GPU Technology Why is it tough? Let’s take a look at the programming model… A single computation block is divided in threads in 1D, 2D, or 3D arrays Commonly known as Thread Block Threads are executed in warps (groups of 32)

GPU Optimization Issues General ◦ Load balancing between computational blocks. ◦ Data caching for reused data. ◦ Data prefetching (to mask memory latency). ◦ Avoid going to SLOW global memory as much as possible ◦ Memory coalesced access (per warp) GPU Specific ◦ Avoid shared memory bank conflict. ◦ Avoid divergent branches (within same warp). ◦ Avoid using many registers per thread (63 in Fermi). ◦ Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)

The SYMV Kernel A level-2 BLAS kernel ◦ Compute: Y = α × A × X + β × Y  A is a symmetric matrix (S-D-C-Z)  X and Y are vectors  Α and β are scalars ◦ Only lower/upper side of A should be referenced. ◦ The operation of mat-vec multiplication involve data reuse in the vector X only. ◦ No data reuse can be exploited regarding the elements of matrix A (except for symmetry).

MAGMA SYMV Kernel (SC’11 paper) Main ideas ◦ Matrix is divided into 64×64 sub-matrices. ◦ Each computation block is responsible for one horizontal row of submatrices. ◦ A computation block starts by the diagonal sub-matrix of the assigned row. ◦ Non diagonal sub-matrices are regarded twice  One for non-transposed sub-matrix.  Second for transposed sub-matrix to exploit symmetry. ◦ Recursive Blocking  Used to save shared memory.  Each sub-matrix is processed in 32×32 chunks ◦ Pointer Re-directing  Used to handle non 64X matrix dimension

MAGMA SYMV Kernel ++++ Reduction through SHMEM/ REG + + Reduction through GLMEM – computed by other blocks Spelled to GLMEM for other blocks

Main Ideas of our Design Same 64×64 block size as MAGMA Diagonal Blocks are isolated from non-diagonal ones. Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. No Recursive Blocking ◦ Fermi has enough shared memory (up to 48K). ◦ Allows more efficient data prefetching (in diagonal submatrices) Shared memory usage is restricted to reduction operation only ◦ In Fermi, SHMEM latency is high (compared to previous GPUs) ◦ In MAGMA, SHMEM is used in reduction as well as storing partial results ◦ In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.

The new SYMV kernel Reduction through SHMEM/ REG + + Reduction through GLMEM- computed by other blocks Spelled to GLMEM for other blocks

Experiments The new kernel ◦ was written in CUDA C ver 4.0 ◦ was integrated into MAGMA/BLAS for testing. ◦ is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). ◦ is tested on Fermi (Tesal C2070) GPU with 6 GB of memory

Performance Results “cont.”

Performance Results

What helped us? PAPI CUDA Component ◦ Extract performance counters during kernel execution. ◦ Really easy to use (even for a first time user)! ◦ Mainly used to identify where possible improvements are possible.  Shared memory bank conflict  Global memory misses (load/store)  Divergent branches  Local memory usage.

What helped us? “cont.” NVIDIA compute profiler ◦ Extract information unavailable/hard to get through PAPI CUDA component.  Registers per thread.  GPU time  Occupancy analysis  Kernel memory bandwidth

Future Work Distribution of work among computation blocks is not balanced. Balancing load may lead to further improvement, but locality will not be exploited. 1D Block cyclic assignment is intended

Credits Rajib Nath (University of California, San Diego) ◦ Fruitful discussion about the design of the MAGMA SYMV kernel. ◦ Guidelines for possible improvements. Heike Jagode (UTK) ◦ Guidelines installation/usage of PAPI

Thank You Question?