Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

Similar presentations


Presentation on theme: "Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus."— Presentation transcript:

1 Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral José M.F. Moura and Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)

2 Carnegie Mellon Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

3 Carnegie Mellon Moore’s Law and High(est) Performance Scientific Computing  arithmetic cost model (counting adds and mults) is not accurate for predicting runtime  best code is machine dependent (registers/caches size, structure)  hand-tuned code becomes obsolete as fast as it is written  compiler limitations Moore’s Law:  processor-memory bottleneck  short life cycles of computers  very complex architectures vendor specific special instructions (MMX, SSE, FMA, …) undocumented features (single processor, off-the-shelf) Consequences for software/algorithms: Portable performance requires automation

4 Carnegie Mellon SPIRAL Automates  cuts development costs  code less error-prone  takes advantage of architecture specific features  porting without loss of performance  systematic exploration of alternatives both at algorithmic and code level  are performance critical Implementation Platform-Adaptation Optimization of DSP algorithms A library generator for highly optimized signal processing algorithms

5 Carnegie Mellon SPIRAL system DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back

6 Carnegie Mellon Related Work on Code Generation/Adaptation  PhiPAC, ATLAS (Linear algebra)  Enumeration and evaluation of different blocking, looping, etc. strategies for BLAS routines  SPARSITY (sparse matrix-vector multiply)  Search for optimal blocking strategy to improve register performance  FFTW (discrete Fourier transform package)  Generated code modules (machine independent) for small sizes  Flexible recursion to adapt to memory hierarchy SPIRAL  Code generation and adaptation for an entire domain (linear transforms) of structurally complex algorithms  Adaptation to all architecture features (memory, cache, register, etc.) by automatic exploration of algorithm space

7 Carnegie Mellon DSP Transform Algorithm

8 Carnegie Mellon DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4):  algorithms reduce arithmetic cost O(n^2)  O(nlog(n))  product of structured sparse matrices  mathematical notation exhibits structure Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

9 Carnegie Mellon DSP Algorithms: Terminology (SPIRAL) Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code

10 Carnegie Mellon DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform

11 Carnegie Mellon Rules = Breakdown Strategies base case recursive translation iterative recursive iterative/ recursive built from few constructs and primitives

12 Carnegie Mellon Algorithms = Ruletrees = Formulas R1 R3 R6 R4 R3 R1 R6 R4 R1 R6

13 Carnegie Mellon Formula for a DCT, size 16

14 Carnegie Mellon Number of Formulas/Algorithms k123456789k123456789 # DFTs, size 2^k 1 6 40 296 27744 162570361280 ~1.01 10^27 ~2.31 10^61 ~2.86 10^133 # DCT IV, size 2^k 1 10 126 31242 1924443362 7343815121631354242 ~1.07 10^38 ~2.30 10^76 ~1.06 10^153 exponential search space Using the rules included in SPIRAL:

15 Carnegie Mellon Algorithm (Formula) Implementation DSP Transform

16 Carnegie Mellon Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )

17 Carnegie Mellon SPL Syntax (Subset)  matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...)  direct matrix description: (matrix (a11 a12...) (a21 a22...)...) (diagonal (d1 d2...)) (permutation (p1 p2...))  parameterized matrices: (I n) (F n)  scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04  definition of new symbols: (define name formula) (template formula (i-code-list)  directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling

18 Carnegie Mellon SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 fast algorithm as formula as SPL program #codetype complexreal

19 Carnegie Mellon SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code C, FORTRAN function Template Table SPL FormulaTemplate DefinitionSymbol Definition Optimization I-Code SPL Program Built-in optimizations:  single static assignment code  no reuse of temporary vars  only scalar temporary vars  constants precomputed  limited CSE Extensible through templates

20 Carnegie Mellon SIMD Short Vector Extensions + x vector length = 4 (4-way)  Extension to instruction set architecture  Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4)  Requires fine grain parallelism  Large potential speed-up  SIMD instructions are architecture specific  No common API (usually assembly hand coding)  Performance very sensitive to memory access  Automatic (compiler) vectorization very limited Problems: very difficult to use

21 Carnegie Mellon Vector code generation from SPL formulas Naturally vectorizable construct A xy vector length P i, Q i permutations D i, E i diagonals A i arbitrary formulas νSIMD vector length (Current) generic construct completely vectorizable: Vectorization in two steps: 1.Formula manipulation using manipulation rules 2.Code generation (vector code + C code)

22 Carnegie Mellon Algorithm (Formula) Implementation DSP Transform Search

23 Carnegie Mellon Why Search? DCT, type IV, size 16  maaaany different formulas  large spread in runtimes, even for modest size  precisely equal arithmetic cost  best formula is platform-dependent ~31000 formulas Toy problem: scheduled

24 Carnegie Mellon Search Methods available in SPIRAL  Exhaustive Search  Dynamic Programming (DP)  Random Search  Hill Climbing  STEER (similar to a genetic algorithm) PossibleFormulas SizesTimedResults ExhaustVery smallAllBest DPAll10s-100s(very) good RandomAllUser decidedfair/good Hill ClimbingAll100s-1000sGood STEERAll100s-1000s(very) good Search over algorithm space and implementation options (degree of unrolling)

25 Carnegie Mellon STEER Population n: Population n+1: …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest

26 Carnegie Mellon Learning to Generate Fast Algorithms Learns from given dataset (formulas + runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT

27 Carnegie Mellon Experimental Results

28 Carnegie Mellon Generated DFT Code: Pentium 4, SSE (Pseudo) gflop/s DFT 2 n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 3.1 hand-tuned vendor assembly code * P. Rodriguez. A Radix-2 FFT Algorithm for Modern Single Instruction Multiple Data (SIMD) Architectures. Proc. ICASSP 2002 *

29 Carnegie Mellon Generated DFT Code: Pentium 4, SSE2 gflops DFT 2 n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 1.8

30 Carnegie Mellon Other transforms gflops transform size 2-dim DCT 2 n x 2 n Pentium 4, 2.53 GHz, SSE WHT 2 n Pentium 4, 2.53 GHz, SSE speedups (vector to C code) up to factor of 3  WHT has only additions  very simple transform

31 Carnegie Mellon Best DFT Trees, size = 1024 Best DFT Trees, size 2 10 = 1024 scalar C vect SIMD Pentium 4 float Pentium 4 double Pentium III float AthlonXP float 10 6 4 4 8 7 5 8 6 4 2 2 2 22 22 22 2 2 1 2 23 8 5 2 2 2 3 9 7 5 1 2 2 23 6 2 4 22 22 4 4 2 6 24 22 2 5 2 5 233 6 3 4 223 8 5 2 2 2 3 6 4 4 22 22 2 5 2 5 233 trees platform/datatype dependent

32 Carnegie Mellon Crosstiming of best trees on Pentium 4 Relative performance w.r.t. best DFT 2 n single precision, runtime of best found of other platforms n software adaptation is necessary e.g., ~50% performance loss by using PIII code on P4

33 Carnegie Mellon Conclusions  Mathematical computer representation of algorithms  Automatic translation of algorithms into code SPIRAL closes the gap between math domain (algorithms) and implementation domain (programs)  High level: Mathematical manipulation of algorithms  Low level: Coding degrees of freedom SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives http://www.ece.cmu.edu/~spiral

34 Carnegie Mellon References Related Work R.C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing 1998. Math-atlas.sourceforge.net M. Frigo and S.-G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. ICASSP 1998, pp. 1381-1384. www.fftw.org E.-J. Im and K. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proc. ICCS 2001, pp. 127-136. Further Reading on SPIRAL M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. To appear in Journal of High Performance Computing and Applications. J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. PLDI 2001, pp. 298-308. Bryan Singer and Manuela Veloso. Automating the Modeling and Optimization of the Performance of Signal Transforms. IEEE Trans. Signal Processing, 50(8), 2002, pp. 2003-2014. F. Franchetti and M. Püschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proc. IPDPS 2002. F. Franchetti and M. Püschel. Short Vector Code Generation for the Discrete Fourier Transform. To appear in Proc. IPDPS 2003. http://www.ece.cmu.edu/~spiral


Download ppt "Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus."

Similar presentations


Ads by Google