Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students José M.F. Moura and Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)
Carnegie Mellon Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.
Carnegie Mellon Moore’s Law and High(est) Performance Scientific Computing arithmetic cost model (counting adds and mults) is not accurate for predicting runtime best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations Moore’s Law: processor-memory bottleneck short life cycles of computers very complex architectures vendor specific special instructions (MMX, SSE, FMA, …) undocumented features (single processor, off-the-shelf) Consequences for software/algorithms: Portable performance requires automation
Carnegie Mellon SPIRAL Automates cuts development costs code less error-prone takes advantage of architecture specific features porting without loss of performance systematic exploration of alternatives both at algorithmic and code level are performance critical Implementation Platform-Adaptation Optimization of DSP algorithms A library generator for highly optimized signal processing algorithms
Carnegie Mellon SPIRAL system DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back
Carnegie Mellon Related Work on Code Generation/Adaptation PhiPAC, ATLAS (Linear algebra) Enumeration and evaluation of different blocking, looping, etc. strategies for BLAS routines SPARSITY (sparse matrix-vector multiply) Search for optimal blocking strategy to improve register performance FFTW (discrete Fourier transform package) Generated code modules (machine independent) for small sizes Flexible recursion to adapt to memory hierarchy SPIRAL Code generation and adaptation for an entire domain (linear transforms) of structurally complex algorithms Adaptation to all architecture features (memory, cache, register, etc.) by automatic exploration of algorithm space
Carnegie Mellon DSP Transform Algorithm
Carnegie Mellon DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): algorithms reduce arithmetic cost O(n^2) O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product
Carnegie Mellon DSP Algorithms: Terminology (SPIRAL) Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code
Carnegie Mellon DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform
Carnegie Mellon Rules = Breakdown Strategies base case recursive translation iterative recursive iterative/ recursive built from few constructs and primitives
Carnegie Mellon Algorithms = Ruletrees = Formulas R1 R3 R6 R4 R3 R1 R6 R4 R1 R6
Carnegie Mellon Formula for a DCT, size 16
Carnegie Mellon Number of Formulas/Algorithms k k # DFTs, size 2^k ~ ^27 ~ ^61 ~ ^133 # DCT IV, size 2^k ~ ^38 ~ ^76 ~ ^153 exponential search space Using the rules included in SPIRAL:
Carnegie Mellon Algorithm (Formula) Implementation DSP Transform
Carnegie Mellon Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )
Carnegie Mellon SPL Syntax (Subset) matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...) direct matrix description: (matrix (a11 a12...) (a21 a22...)...) (diagonal (d1 d2...)) (permutation (p1 p2...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling
Carnegie Mellon SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 fast algorithm as formula as SPL program #codetype complexreal
Carnegie Mellon SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code C, FORTRAN function Template Table SPL FormulaTemplate DefinitionSymbol Definition Optimization I-Code SPL Program Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templates
Carnegie Mellon SIMD Short Vector Extensions + x vector length = 4 (4-way) Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Requires fine grain parallelism Large potential speed-up SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic (compiler) vectorization very limited Problems: very difficult to use
Carnegie Mellon Vector code generation from SPL formulas Naturally vectorizable construct A xy vector length P i, Q i permutations D i, E i diagonals A i arbitrary formulas νSIMD vector length (Current) generic construct completely vectorizable: Vectorization in two steps: 1.Formula manipulation using manipulation rules 2.Code generation (vector code + C code)
Carnegie Mellon Algorithm (Formula) Implementation DSP Transform Search
Carnegie Mellon Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size precisely equal arithmetic cost best formula is platform-dependent ~31000 formulas Toy problem: scheduled
Carnegie Mellon Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) PossibleFormulas SizesTimedResults ExhaustVery smallAllBest DPAll10s-100s(very) good RandomAllUser decidedfair/good Hill ClimbingAll100s-1000sGood STEERAll100s-1000s(very) good Search over algorithm space and implementation options (degree of unrolling)
Carnegie Mellon STEER Population n: Population n+1: …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest
Carnegie Mellon Learning to Generate Fast Algorithms Learns from given dataset (formulas + runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT
Carnegie Mellon Experimental Results
Carnegie Mellon Generated DFT Code: Pentium 4, SSE (Pseudo) gflop/s DFT 2 n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 3.1 hand-tuned vendor assembly code * P. Rodriguez. A Radix-2 FFT Algorithm for Modern Single Instruction Multiple Data (SIMD) Architectures. Proc. ICASSP 2002 *
Carnegie Mellon Generated DFT Code: Pentium 4, SSE2 gflops DFT 2 n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 1.8
Carnegie Mellon Other transforms gflops transform size 2-dim DCT 2 n x 2 n Pentium 4, 2.53 GHz, SSE WHT 2 n Pentium 4, 2.53 GHz, SSE speedups (vector to C code) up to factor of 3 WHT has only additions very simple transform
Carnegie Mellon Best DFT Trees, size = 1024 Best DFT Trees, size 2 10 = 1024 scalar C vect SIMD Pentium 4 float Pentium 4 double Pentium III float AthlonXP float trees platform/datatype dependent
Carnegie Mellon Crosstiming of best trees on Pentium 4 Relative performance w.r.t. best DFT 2 n single precision, runtime of best found of other platforms n software adaptation is necessary e.g., ~50% performance loss by using PIII code on P4
Carnegie Mellon Conclusions Mathematical computer representation of algorithms Automatic translation of algorithms into code SPIRAL closes the gap between math domain (algorithms) and implementation domain (programs) High level: Mathematical manipulation of algorithms Low level: Coding degrees of freedom SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives
Carnegie Mellon References Related Work R.C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing Math-atlas.sourceforge.net M. Frigo and S.-G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. ICASSP 1998, pp E.-J. Im and K. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proc. ICCS 2001, pp Further Reading on SPIRAL M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. To appear in Journal of High Performance Computing and Applications. J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. PLDI 2001, pp Bryan Singer and Manuela Veloso. Automating the Modeling and Optimization of the Performance of Signal Transforms. IEEE Trans. Signal Processing, 50(8), 2002, pp F. Franchetti and M. Püschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proc. IPDPS F. Franchetti and M. Püschel. Short Vector Code Generation for the Discrete Fourier Transform. To appear in Proc. IPDPS