Automatic Performance Tuning

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Short Vector SIMD Code Generation for DSP Algorithms
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
High Performance Linear Transform Program Generation for the Cell BE
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
Introduction to MMX, XMM, SSE and SSE2 Technology
The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.
Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
1 Chapter 4-2 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 16.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
TI Information – Selective Disclosure
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
Ioannis E. Venetis Department of Computer Engineering and Informatics
A Comparison of Cache-conscious and Cache-oblivious Programs
Empirical Search and Library Generators
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
for more information ... Performance Tuning
L18: CUDA, cont. Memory Hierarchy and Examples
High Performance Computing (CS 540)
Memory Hierarchies.
Communication costs of Schönhage-Strassen fast integer multiplication
Adaptive Strassen and ATLAS’s DGEMM
Automatic Measurement of Instruction Cache Capacity in X-Ray
STUDY AND IMPLEMENTATION
Optimizing MMM & ATLAS Library Generator
A Comparison of Cache-conscious and Cache-oblivious Codes
What does it take to produce near-peak Matrix-Matrix Multiply
Cache Models and Program Transformations
Cache-oblivious Programming
The Price of Cache-obliviousness
Outline Announcements Loading BLAS Loading LAPACK Add/drop by Monday
Presentation transcript:

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University

Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Integer Multiplication Automated Performance Tuning (IEEE Proc. Vol. 93, No. 2, Feb. 2005) ATLAS FFTW SPIRAL GMP

Matrix Multiplication and the FFT

Basic Linear Algebra Subprograms (BLAS) Level 1 – vector-vector, O(n) data, O(n) operations Level 2 – matrix-vector, O(n2) data, O(n2) operations Level 3 – matrix-matrix, O(n2) data, O(n3) operations = data reuse = locality! LAPACK built on top of BLAS (level 3) Blocking (for the memory hierarchy) is the single most important optimization for linear algebra algorithms GEMM – General Matrix Multiplication SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C := alpha*op( A )*op( B ) + beta*C, where op(X) = X or X’

DGEMM … * Form C := alpha*A*B + beta*C. * DO 90, J = 1, N IF( BETA.EQ.ZERO )THEN DO 50, I = 1, M C( I, J ) = ZERO 50 CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 60, I = 1, M C( I, J ) = BETA*C( I, J ) 60 CONTINUE END IF DO 80, L = 1, K IF( B( L, J ).NE.ZERO )THEN TEMP = ALPHA*B( L, J ) DO 70, I = 1, M C( I, J ) = C( I, J ) + TEMP*A( I, L ) 70 CONTINUE 80 CONTINUE 90 CONTINUE

Matrix Multiplication Performance

Matrix Multiplication Performance

Numeric Recipes Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Cambridge University Press, 1992. “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines. 1. Preliminarys 2. Solutions of Linear Algebraic Equations … 12. Fast Fourier Transform 19. Partial Differential Equations 20. Less Numerical Algorithms

four1

four1 (cont)

FFT Performance

Atlas Architecture and Search Parameters NB – L1 data cache tile size NCNB – L1 data cache tile size for non-copying version MU, NU – Register tile size KU – Unroll factor for k’ loop LS – Latency for computation scheduling FMA – 1 if fused multiply-add available, 0 otherwise FF, IF, NF – Scheduling of loads Yotov et al., Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005

ATLAS Code Generation Optimization for locality Cache tiling, Register tiling

ATLAS Code Generation Register Tiling Loop unrolling MU + NU + MU×NU ≤ NR Loop unrolling Scalar replacement Add/mul interleaving Loop skewing Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’ A C B NU MU K mul1 mul2 … mulLs add1 mulLs+1 add2 mulMu×Nu addMu×Nu-Ls+2 addMu×Nu NB NB

ATLAS Search Estimate Machine Parameters (C1, NR, FMA, LS) Used to bound search Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter) Search order NB MU, NU KU LS FF, IF, NF NCNB Cleanup codes

Using FFTW

FFTW Infrastructure Right Recursive 15 3 12 4 8 3 5 Use dynamic programming to find an efficient way to combine code sequences. Combine code sequences using divide and conquer structure in FFT Codelets (optimized code sequences for small FFTs) Plan encodes divide and conquer strategy and stores “twiddle factors” Executor computes FFT of given data using algorithm described by plan. Right Recursive 15 3 12 4 8 3 5

implementation options SPIRAL system user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) DSP transform specifies Mathematician Expert Programmer platform-adapted implementation comes back

DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation

Cooley-Tukey Factorization Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product algorithms reduce arithmetic cost O(N^2)O(Nlog(N)) product of structured sparse matrices mathematical notation exhibits structure introduces degrees of freedom (different breakdown strategies) which can be optimized

Algorithms = Ruletrees = Formulas

Generated DFT Vector Code: Pentium 4, SSE hand-tuned vendor assembly code (Pseudo) gflop/s n DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 3.1

Best DFT Trees, size 210 = 1024 trees platform/datatype dependent Pentium 4 float Pentium 4 double Pentium III float AthlonXP float 10 10 10 10 2 8 4 6 scalar 2 8 4 6 2 6 2 2 2 4 2 5 2 2 3 3 2 4 2 2 2 3 2 2 10 10 10 10 C vect 4 6 2 8 6 4 4 6 2 2 4 2 2 5 2 4 2 2 2 4 2 2 2 2 2 3 2 2 2 2 10 10 10 10 8 2 9 1 SIMD 5 5 5 5 1 7 2 7 2 5 2 3 2 3 2 3 2 3 2 5 2 3 2 3 trees platform/datatype dependent

Crosstiming of best trees on Pentium 4 Slowdown factor w.r.t. best n DFT 2n single precision, runtime of best found of other platforms software adaptation is necessary

GMP Integer Multiplication Polyalgorithm Classical O(n2) algorithm Karatsuba Toom-Cook Schönhage-Strassen (FFT) Algorithm Thresholds Thresholds determine when one algorithm switches to the next Tune-up program empirically sets threshold parameters for given platform FFT 5888 Toom3 117 Karatsuba 34 ... *Default tuning @ AMD Athlon X2 2.8GHz / 32-bit Linux