Download presentation
Presentation is loading. Please wait.
1
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa
2
Agenda Motivation GPU Technology GPU Optimization issues MAGMA SYMV kernel The new SYMV Kernel Performance Results What Helped us? Future Work
3
Motivation GPUs are invading HPC community. ◦ Many cores (~512) on a single GPU card. ◦ Best suited for massively (embarrassingly) parallel problem. ◦ Unlike CPUs, dedicate more silicon to floating point operation. ◦ Unlike CPUs, consume much less power. Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) ◦ The world’s biggest supercomputer to be built will have 18,000 GPUs To get high performance, it is quite a challenge
4
GPU Technology (Fermi) SM L2-Cache DRAM
5
GPU Technology (Fermi) For each SM ◦ 32 cores. ◦ 64K L1/SHMEM ◦ 16 LS/ST units ◦ 4 SFUs ◦ 32768 registers (32-bits)
6
GPU Technology (Fermi) Fermi GPUs are the first GPU in the world with complete memory hierarchy ◦ (registers, L1 cache/SHMEM, L2 cache, DRAM) Fermi is the first GPU with ECC support. Fermi theoretical peak performance: ◦ 1 Tflop/s (single precision) ◦ ~ 500 Gflop/s (double precision)
7
GPU Technology Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory
8
GPU Technology Why is it tough? Let’s take a look at the programming model… A single computation block is divided in threads in 1D, 2D, or 3D arrays Commonly known as Thread Block Threads are executed in warps (groups of 32)
9
GPU Optimization Issues General ◦ Load balancing between computational blocks. ◦ Data caching for reused data. ◦ Data prefetching (to mask memory latency). ◦ Avoid going to SLOW global memory as much as possible ◦ Memory coalesced access (per warp) GPU Specific ◦ Avoid shared memory bank conflict. ◦ Avoid divergent branches (within same warp). ◦ Avoid using many registers per thread (63 in Fermi). ◦ Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)
10
The SYMV Kernel A level-2 BLAS kernel ◦ Compute: Y = α × A × X + β × Y A is a symmetric matrix (S-D-C-Z) X and Y are vectors Α and β are scalars ◦ Only lower/upper side of A should be referenced. ◦ The operation of mat-vec multiplication involve data reuse in the vector X only. ◦ No data reuse can be exploited regarding the elements of matrix A (except for symmetry).
11
MAGMA SYMV Kernel (SC’11 paper) Main ideas ◦ Matrix is divided into 64×64 sub-matrices. ◦ Each computation block is responsible for one horizontal row of submatrices. ◦ A computation block starts by the diagonal sub-matrix of the assigned row. ◦ Non diagonal sub-matrices are regarded twice One for non-transposed sub-matrix. Second for transposed sub-matrix to exploit symmetry. ◦ Recursive Blocking Used to save shared memory. Each sub-matrix is processed in 32×32 chunks ◦ Pointer Re-directing Used to handle non 64X matrix dimension
12
MAGMA SYMV Kernel ++++ Reduction through SHMEM/ REG + + Reduction through GLMEM – computed by other blocks Spelled to GLMEM for other blocks
13
Main Ideas of our Design Same 64×64 block size as MAGMA Diagonal Blocks are isolated from non-diagonal ones. Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. No Recursive Blocking ◦ Fermi has enough shared memory (up to 48K). ◦ Allows more efficient data prefetching (in diagonal submatrices) Shared memory usage is restricted to reduction operation only ◦ In Fermi, SHMEM latency is high (compared to previous GPUs) ◦ In MAGMA, SHMEM is used in reduction as well as storing partial results ◦ In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.
14
The new SYMV kernel + + + + Reduction through SHMEM/ REG + + Reduction through GLMEM- computed by other blocks Spelled to GLMEM for other blocks
15
Experiments The new kernel ◦ was written in CUDA C ver 4.0 ◦ was integrated into MAGMA/BLAS for testing. ◦ is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). ◦ is tested on Fermi (Tesal C2070) GPU with 6 GB of memory
16
Performance Results “cont.”
17
Performance Results
18
What helped us? PAPI CUDA Component ◦ Extract performance counters during kernel execution. ◦ Really easy to use (even for a first time user)! ◦ Mainly used to identify where possible improvements are possible. Shared memory bank conflict Global memory misses (load/store) Divergent branches Local memory usage.
19
What helped us? “cont.” NVIDIA compute profiler ◦ Extract information unavailable/hard to get through PAPI CUDA component. Registers per thread. GPU time Occupancy analysis Kernel memory bandwidth
20
Future Work Distribution of work among computation blocks is not balanced. Balancing load may lead to further improvement, but locality will not be exploited. 1D Block cyclic assignment is intended 0 01 102 2143 32024 431345
21
Credits Rajib Nath (University of California, San Diego) ◦ Fruitful discussion about the design of the MAGMA SYMV kernel. ◦ Guidelines for possible improvements. Heike Jagode (UTK) ◦ Guidelines installation/usage of PAPI
22
Thank You Question?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.