Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
CUDA Linear Algebra Library and Next Generation
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
International Supercomputing Conference 2014 Leipzig, Germany Tutorial on June 22, 2014.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Parallel Direct Methods for Sparse Linear Systems
Generalized and Hybrid Fast-ICA Implementation using GPU
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Parallel Plasma Equilibrium Reconstruction Using GPU
CS427 Multicore Architecture and Parallel Computing
The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.
GPU Computing CIS-543 Lecture 10: CUDA Libraries
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
A computational loop k k Integration Newton Iteration
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Introduction to cuBLAS
Dense Linear Algebra (Data Distributions)
6- General Purpose GPU Programming
Parallelized Analytic Placer
A computational loop k k Integration Newton Iteration
Presentation transcript:

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Overview Batched LA LU factorization Cholesky Factorization Matrix Blocking Recursive Blocking Parallel Swap Wave2D Doubled Message Size Prioritized Calculations

BATCHED LINEAR ALGEBRA

Motivation What is Batched Many small matrix operations (512 or less) BLAS Basic Linear Algebra Subprograms – Fortran – Highly efficient cuBLAS CUDA Basic Linear Algebra Subprograms Batched Linear Algebra in Applications Computer vision, and anomaly detection of images Magnetic resonance imaging (MRI) (billions small 8x8 and 32x32 eigenvalue problems need to be solved) Radar signal processing (requires a batched 200x200 QR decomposition to be computed) Hydrodynamic simulations (need to compute thousands of matrix-matrix (dgemm) or matrix-vector(dgemv) products of matrices of well over 100x100)

Related Work CPU Core - MKL or ACML [1] Large problems – CPU/GPU data transfers can be overlapped with GPU work [2] CUDA threads – if it can fit into the GPU’s memory [3] Small problems can be solved efficiently on single CPU core using vendor supplied libraries such as MKL or ACML [1]. For GPU architectures, prior work has been concentrated on achieving high-performance for large problems through hybrid algorithms [2]. For large enough problems, the panel factorizations and associated CPU-GPU data transfers can be overlapped with GPU work. There have been batched algorithms developed entirely for GPU execution, where a single CUDA thread, or a single thread block, was used to solve one system at a time. Although these algorithms were only used for problems that could fit into the GPU’s memory [3].

LU Factorization

LU Factorization http://www.personal.soton.ac.uk/jav/soton/HELM/workbooks/workbook_30/30_3_lu_decomposition.pdf

Cholesky Factorization A = LLT test https://www.youtube.com/watch?v=NppyUqgQqd0

Matrix Blocking http://mathworld.wolfram.com/BlockMatrix.html

GPU Architecture (GTX 960M) 128 Cores per SMM (Maxwell Streaming Multiprocessor Register = 64 KB L1 = 24 KB 2x2 doubles = 32 bytes 768 into L1 4x4 doubles = 128 bytes 192 into L1 8x8 doubles = 512 bytes 48 into L1

Recursive Blocking Recursive Blocking Since we cannot load the entire panel into the shared memory of the GPU, the columns to the right (in case of LU) or to the left (in case of Cholesky) are loaded back and forth from the main memory at every step. The goal is to recursively reduce the size of the block so that multiple blocks can be loaded on the same streaming multiprocessor at the same time and a block waiting for data from the memory can be pushed back and allow a thread ready to execute [1, 4].

Parallel Swapping Parallel Swapping In order to overcome the bottleneck of swapping we need to apply row swaps in parallel. The first section of rows are those used by the dtrsm (solves for a triangular system of equations with multiple right-hand sides) kernel that is applied right after the dlaswp (performs a series of row interchanges on the matrix A between a set range of rows). To  optimization, use shared memory to load a chunk of the section of rows, and apply the dlaswp followed by the dtrsm at the same time. Change the algorithm to generate two pivot vectors, where the first vector gives the destination and the second gives the rows to be swapped [1].

Parallel Swap https://en.wikipedia.org/wiki/Swap_(computer_programming)#Parallel_execution

Parallel Swap The execution trace of the batched LU for 2000 matrices of size 512. [1]

Wave2D

Introduction Given Schroedinger’s wave dissemination algorithm and asked to parallelize this using MPI to execute on multiple nodes.

Double Message Size

Test

Test Methods Parameters 1, 4, 8, and 16 nodes 100 iteration average 500 time steps Speedup from 4, 8, and 16 averaged 4 Methods Standard Doubled Prioritized Doubled/Prioritized

Average Speedup – Large matrices

Average Speedup – Small matrices

Efficiency

Comparisons

Formula ? Size = Matrix Size Msg Size = Size * 8 * 2 bytes MTU (Maximum Transmission Unit) = 1500 bytes # Msg = (500 – 2) / 2 = 248 # Packets = ROUNDUP(Msg Size / MTU) Msg Time = # Messages * # Packets Sequential Runtime Estimated Calculations per Msg Msg time to Calc time ratio = (Runtime / Calcs per msg) / Msg Time Size Msg Time Time Calc Time/Calc Ratio 176 496 347 0.70 256 744 716 0.96 336 992 1277 1.29 416 1240 1996 1.61 1488 2864 1.92 576 1736 3860 2.22

References Haidar, A., Tomov, S., Dong, T., Dongarra, J., Luszczek, P.: Optimization for Performance and Energy for Batched Matrix Computations on GPUs, ACM (2015) S. Tomov, R. Nath, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In Proc. of the IEEE IPDPS'10, Atlanta, GA, April 19-23 2014. I. Wainwright. Optimized LU-decomposition with full pivot for small batched matrices, April, 2013. GTC'13 ID S3069. Dong, T., Haidar, A., Luszczek, P., Tomov, S., Abdelfattah, A., Dongarra, J.: MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs, ICL Tech Report (2016)