Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Overview Batched LA LU factorization Cholesky Factorization
Matrix Blocking Recursive Blocking Parallel Swap Wave2D Doubled Message Size Prioritized Calculations

BATCHED LINEAR ALGEBRA

Motivation What is Batched Many small matrix operations (512 or less)
BLAS Basic Linear Algebra Subprograms – Fortran – Highly efficient cuBLAS CUDA Basic Linear Algebra Subprograms Batched Linear Algebra in Applications Computer vision, and anomaly detection of images Magnetic resonance imaging (MRI) (billions small 8x8 and 32x32 eigenvalue problems need to be solved) Radar signal processing (requires a batched 200x200 QR decomposition to be computed) Hydrodynamic simulations (need to compute thousands of matrix-matrix (dgemm) or matrix-vector(dgemv) products of matrices of well over 100x100)

Related Work CPU Core - MKL or ACML [1]
Large problems – CPU/GPU data transfers can be overlapped with GPU work [2] CUDA threads – if it can fit into the GPU’s memory [3] Small problems can be solved efficiently on single CPU core using vendor supplied libraries such as MKL or ACML [1]. For GPU architectures, prior work has been concentrated on achieving high-performance for large problems through hybrid algorithms [2]. For large enough problems, the panel factorizations and associated CPU-GPU data transfers can be overlapped with GPU work. There have been batched algorithms developed entirely for GPU execution, where a single CUDA thread, or a single thread block, was used to solve one system at a time. Although these algorithms were only used for problems that could fit into the GPU’s memory [3].

LU Factorization

Cholesky Factorization A = LLT
test

Matrix Blocking

GPU Architecture (GTX 960M)
128 Cores per SMM (Maxwell Streaming Multiprocessor Register = 64 KB L1 = 24 KB 2x2 doubles = 32 bytes 768 into L1 4x4 doubles = 128 bytes 192 into L1 8x8 doubles = 512 bytes 48 into L1

Recursive Blocking Recursive Blocking
Since we cannot load the entire panel into the shared memory of the GPU, the columns to the right (in case of LU) or to the left (in case of Cholesky) are loaded back and forth from the main memory at every step. The goal is to recursively reduce the size of the block so that multiple blocks can be loaded on the same streaming multiprocessor at the same time and a block waiting for data from the memory can be pushed back and allow a thread ready to execute [1, 4].

Parallel Swapping Parallel Swapping
In order to overcome the bottleneck of swapping we need to apply row swaps in parallel. The first section of rows are those used by the dtrsm (solves for a triangular system of equations with multiple right-hand sides) kernel that is applied right after the dlaswp (performs a series of row interchanges on the matrix A between a set range of rows). To optimization, use shared memory to load a chunk of the section of rows, and apply the dlaswp followed by the dtrsm at the same time. Change the algorithm to generate two pivot vectors, where the first vector gives the destination and the second gives the rows to be swapped [1].

Parallel Swap

Parallel Swap The execution trace of the batched LU for 2000 matrices of size 512. [1]

Wave2D

Introduction Given Schroedinger’s wave dissemination algorithm and asked to parallelize this using MPI to execute on multiple nodes.

Double Message Size

Test Methods Parameters 1, 4, 8, and 16 nodes 100 iteration average
500 time steps Speedup from 4, 8, and 16 averaged 4 Methods Standard Doubled Prioritized Doubled/Prioritized

Average Speedup – Large matrices

Average Speedup – Small matrices

Efficiency

Comparisons

Formula ? Size = Matrix Size Msg Size = Size * 8 * 2 bytes
MTU (Maximum Transmission Unit) = 1500 bytes # Msg = (500 – 2) / 2 = 248 # Packets = ROUNDUP(Msg Size / MTU) Msg Time = # Messages * # Packets Sequential Runtime Estimated Calculations per Msg Msg time to Calc time ratio = (Runtime / Calcs per msg) / Msg Time Size Msg Time Time Calc Time/Calc Ratio 176 496 347 0.70 256 744 716 0.96 336 992 1277 1.29 416 1240 1996 1.61 1488 2864 1.92 576 1736 3860 2.22

References Haidar, A., Tomov, S., Dong, T., Dongarra, J., Luszczek, P.: Optimization for Performance and Energy for Batched Matrix Computations on GPUs, ACM (2015) S. Tomov, R. Nath, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In Proc. of the IEEE IPDPS'10, Atlanta, GA, April I. Wainwright. Optimized LU-decomposition with full pivot for small batched matrices, April, GTC'13 ID S3069. Dong, T., Haidar, A., Luszczek, P., Tomov, S., Abdelfattah, A., Dongarra, J.: MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs, ICL Tech Report (2016)

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Similar presentations

Presentation on theme: "Nathan Grabaskas: Batched LA and Parallel Communication Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Similar presentations

Presentation on theme: "Nathan Grabaskas: Batched LA and Parallel Communication Optimization"— Presentation transcript:

Similar presentations

About project

Feedback