Adaptive Strassen and ATLAS’s DGEMM Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI) 12/2/2005 HPC Asia
The Problem: Matrix Computations The evolution of systems is modeled by matrix computations The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing. For example, the solution of linear equations or the solution of least square systems. 12/2/2005 HPC Asia
The Problem: BLAS The Basic Linear Algebra Subroutines is an interface describing a set of (basic) matrix and vector computations Historically, the BLAS was a set of algorithms Library implementing the BLAS are the back-bone of nowadays high performance computations For ScaLAPACK ESSL, PHiPac and ATLAS 12/2/2005 HPC Asia
The Problem: ATLAS Implementation of BLAS 3 are based on Matrix Multiplication In practice, ATLAS automatically generates a custom-tailored MM: It probes the system It tailors a kernel of MM to a specific system It uses the MM as a basic routine for the other BLAS-3 routines 12/2/2005 HPC Asia
Matrix Multiplication (basics) = * C2 C3 A2 B3 B2 B3 C0= A0B0 + A1B2 C1= A0B1 + A1B3 C3= A2B1 + A3B3 C2= A2B0 + A3B2 12/2/2005 HPC Asia
The Problem: MM ATLAS uses this classic matrix multiply For square matrices of size nxn, the algorithm takes O(n3) It achieves 80-90% of peak performance Strassen’s algorithm for large problems. Because it reduces the number of computations (thus shortening the execution time) We investigate the effects on single-processor systems 12/2/2005 HPC Asia
The Problem: Strassen’s Strassen’s for 2n–size matrices O(nlog 7) For even-size matrices, one recursive step is always applicable Otherwise Dynamic and static padding Peeling: For odd-size matrices [Hauss 97 & Luo 2004]: 12/2/2005 HPC Asia
Odd-Size Square Matrices B 2n B0 A0 2n 2n+1 2n 2n A0 * B0 is an even-size problem. Strassen is applied once more 2n+1 12/2/2005 HPC Asia
Our Approach: balanced division For any matrix size, we apply a balanced Strassen’s division process This reduces the number of computations further than an odd/even size problem (or padded) Balanced division = balanced workload Thus, predictable performance Balanced sized operands Better data cache utilization 12/2/2005 HPC Asia
Balanced Division Matrices Near Square: m = n+p with min|n-p| B0 A0 A1 B1 n m A3 B2 B3 A2 p n p m The quadrants are near square matrices. At any step of the recursion, all sub-matrices are near square matrices 12/2/2005 HPC Asia
Balanced Matrices (New matrix add and multiplication) The balanced division with Strassen’s recursion needs a new MA definition because addition of matrices of different sizes We generalize the operations such that: The algorithm is correct The extra control for the irregular sizes is completely negligible and only for matrix additions 12/2/2005 HPC Asia
Experimental Results We considered 14 systems We hand coded the MA for each specific system We measure performance of ATLAS’s MM and MA We specify an adaptive recursion point size for each system We encode the recursion point in the algorithm We measured the relative performance Strassen vs ATLAS We report the details for three systems shortly 12/2/2005 HPC Asia
12/2/2005 HPC Asia
Opteron Strassen + ATLAS ATLAS’s Performance (the higher the better) 12/2/2005 HPC Asia
8600 PA-RISC Strassen + ATLAS ATLAS’s Performance 12/2/2005 HPC Asia
ALPHA Strassen + ATLAS ATLAS’s Performance 12/2/2005 HPC Asia
Conclusions Our approach uses the balanced division as Strassen’s does We performed an exhaustive testing of performance Some architectures do not offer practical opportunity for S’s We use benchmarking of ATLAS’s MM and MA for specific code tuning. In the spirit of adaptive software packages We speed up ATLAS’s MM without introducing any overhead Due to data layout or extra control. 12/2/2005 HPC Asia
Future work The algorithm extends to rectangular matrices We will characterize its performance Parallel formulation and performance Power management MM and MA compose the application however they have different architecture utilization Hardware configurations adaptation (e.g., Xscale) 12/2/2005 HPC Asia