University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
What is ATLAS A package that adapts to differing architectures via AEOS techniques -Initially, supply BLAS Automated Empirical Optimization of Software (AEOS) -Machine searches opt space -Finds application- apparent architecture AEOS requires: -Method of code variation »Code generation »Multiple implement. »Parameterization -Sophisticated Timers -Robust search heuristic
University of Tennessee Why ATLAS is needed BLAS require many man-hours / platform -Only done if financial incentive is there »Many platforms will never have an optimal version -Lags behind hardware -May not be affordable by everyone -Improves vendor code Allows for portably optimal codes -Obsolescence insurance Operations may be important, but not general enough for standard
University of Tennessee ATLAS Software Coming soon -pthread support -Open source kernels »SSE & 3DNOW! »GOTO ev5/6 BLAS -Performance for banded and packed -More LAPACK Coming not-so- soon -Sparse support -User customization Currently provided -Full BLAS (C & F77) »Level 3 BLAS u Generated GEMM -1-2 hours install time per precision u Recursive GEMM- based L3 BLAS -Antoine Petitet »Level 2 BLAS u GEMV & GER ker »Level 1 BLAS -Some LAPACK »LU, LLt
University of Tennessee Algorithmic Approach for Matrix Multiply Only generated code is on-chip multiply All BLAS operations written in terms of generated on-chip multiply All transpose cases coerced through data copy to 1 case of on-chip multiply -Only 1 case generated per platform M C A B N K N M K * NB
University of Tennessee Algorithmic approach for Level 3 BLAS Recur down to L1 cache block size Need kernel at bottom of recursion -Use gemm-based kernel for portability Recursive TRMM
University of Tennessee 500x500 DGEMM Across Various Architectures
University of Tennessee x 500 Double Precision RB LU factorization
University of Tennessee 500x500 Recursive BLAS on UltraSparc 2200