Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

David Hansen and James Michelussi
Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Overview of Popular DSP Architectures: TI, ADI, Motorola R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Chapter 15 Digital Signal Processing
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Short Vector SIMD Code Generation for DSP Algorithms
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
High Performance Linear Transform Program Generation for the Cell BE
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Generative Programming. Automated Assembly Lines.
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Vector computers.
Custom Reduction of Arithmetic in Linear DSP Transforms S. Misra, A. Zelinski, J. C. Hoe, and M. Püschel Dept. of Electrical and Computer Engineering Carnegie.
Martin Kruliš by Martin Kruliš (v1.0)1.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Advanced Computer Systems
Chapter 1 Introduction.
Empirical Search and Library Generators
Chapter 1 Introduction.
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Vector Processing => Multimedia
INTRODUCTION TO BASIC MATLAB
Automatic Performance Tuning
High Performance Computing (CS 540)
VSIPL++: Parallel Performance HPEC 2004
Presentation transcript:

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students José M.F. Moura and Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)

Carnegie Mellon Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.

Carnegie Mellon Moore’s Law and High(est) Performance Scientific Computing  arithmetic cost model (counting adds and mults) is not accurate for predicting runtime  best code is machine dependent (registers/caches size, structure)  hand-tuned code becomes obsolete as fast as it is written  compiler limitations Moore’s Law:  processor-memory bottleneck  short life cycles of computers  very complex architectures vendor specific special instructions (MMX, SSE, FMA, …) undocumented features (single processor, off-the-shelf) Consequences for software/algorithms: Portable performance requires automation

Carnegie Mellon SPIRAL Automates  cuts development costs  code less error-prone  takes advantage of architecture specific features  porting without loss of performance  systematic exploration of alternatives both at algorithmic and code level  are performance critical Implementation Platform-Adaptation Optimization of DSP algorithms A library generator for highly optimized signal processing algorithms

Carnegie Mellon SPIRAL system DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back

Carnegie Mellon Related Work on Code Generation/Adaptation  PhiPAC, ATLAS (Linear algebra)  Enumeration and evaluation of different blocking, looping, etc. strategies for BLAS routines  SPARSITY (sparse matrix-vector multiply)  Search for optimal blocking strategy to improve register performance  FFTW (discrete Fourier transform package)  Generated code modules (machine independent) for small sizes  Flexible recursion to adapt to memory hierarchy SPIRAL  Code generation and adaptation for an entire domain (linear transforms) of structurally complex algorithms  Adaptation to all architecture features (memory, cache, register, etc.) by automatic exploration of algorithm space

Carnegie Mellon DSP Transform Algorithm

Carnegie Mellon DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4):  algorithms reduce arithmetic cost O(n^2)  O(nlog(n))  product of structured sparse matrices  mathematical notation exhibits structure Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

Carnegie Mellon DSP Algorithms: Terminology (SPIRAL) Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code

Carnegie Mellon DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform

Carnegie Mellon Rules = Breakdown Strategies base case recursive translation iterative recursive iterative/ recursive built from few constructs and primitives

Carnegie Mellon Algorithms = Ruletrees = Formulas R1 R3 R6 R4 R3 R1 R6 R4 R1 R6

Carnegie Mellon Formula for a DCT, size 16

Carnegie Mellon Number of Formulas/Algorithms k k # DFTs, size 2^k ~ ^27 ~ ^61 ~ ^133 # DCT IV, size 2^k ~ ^38 ~ ^76 ~ ^153 exponential search space Using the rules included in SPIRAL:

Carnegie Mellon Algorithm (Formula) Implementation DSP Transform

Carnegie Mellon Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )

Carnegie Mellon SPL Syntax (Subset)  matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...)  direct matrix description: (matrix (a11 a12...) (a21 a22...)...) (diagonal (d1 d2...)) (permutation (p1 p2...))  parameterized matrices: (I n) (F n)  scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04  definition of new symbols: (define name formula) (template formula (i-code-list)  directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling

Carnegie Mellon SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 fast algorithm as formula as SPL program #codetype complexreal

Carnegie Mellon SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code C, FORTRAN function Template Table SPL FormulaTemplate DefinitionSymbol Definition Optimization I-Code SPL Program Built-in optimizations:  single static assignment code  no reuse of temporary vars  only scalar temporary vars  constants precomputed  limited CSE Extensible through templates

Carnegie Mellon SIMD Short Vector Extensions + x vector length = 4 (4-way)  Extension to instruction set architecture  Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4)  Requires fine grain parallelism  Large potential speed-up  SIMD instructions are architecture specific  No common API (usually assembly hand coding)  Performance very sensitive to memory access  Automatic (compiler) vectorization very limited Problems: very difficult to use

Carnegie Mellon Vector code generation from SPL formulas Naturally vectorizable construct A xy vector length P i, Q i permutations D i, E i diagonals A i arbitrary formulas νSIMD vector length (Current) generic construct completely vectorizable: Vectorization in two steps: 1.Formula manipulation using manipulation rules 2.Code generation (vector code + C code)

Carnegie Mellon Algorithm (Formula) Implementation DSP Transform Search

Carnegie Mellon Why Search? DCT, type IV, size 16  maaaany different formulas  large spread in runtimes, even for modest size  precisely equal arithmetic cost  best formula is platform-dependent ~31000 formulas Toy problem: scheduled

Carnegie Mellon Search Methods available in SPIRAL  Exhaustive Search  Dynamic Programming (DP)  Random Search  Hill Climbing  STEER (similar to a genetic algorithm) PossibleFormulas SizesTimedResults ExhaustVery smallAllBest DPAll10s-100s(very) good RandomAllUser decidedfair/good Hill ClimbingAll100s-1000sGood STEERAll100s-1000s(very) good Search over algorithm space and implementation options (degree of unrolling)

Carnegie Mellon STEER Population n: Population n+1: …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest

Carnegie Mellon Learning to Generate Fast Algorithms Learns from given dataset (formulas + runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT

Carnegie Mellon Experimental Results

Carnegie Mellon Generated DFT Code: Pentium 4, SSE (Pseudo) gflop/s DFT 2 n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 3.1 hand-tuned vendor assembly code * P. Rodriguez. A Radix-2 FFT Algorithm for Modern Single Instruction Multiple Data (SIMD) Architectures. Proc. ICASSP 2002 *

Carnegie Mellon Generated DFT Code: Pentium 4, SSE2 gflops DFT 2 n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (vector to C code) up to factor of 1.8

Carnegie Mellon Other transforms gflops transform size 2-dim DCT 2 n x 2 n Pentium 4, 2.53 GHz, SSE WHT 2 n Pentium 4, 2.53 GHz, SSE speedups (vector to C code) up to factor of 3  WHT has only additions  very simple transform

Carnegie Mellon Best DFT Trees, size = 1024 Best DFT Trees, size 2 10 = 1024 scalar C vect SIMD Pentium 4 float Pentium 4 double Pentium III float AthlonXP float trees platform/datatype dependent

Carnegie Mellon Crosstiming of best trees on Pentium 4 Relative performance w.r.t. best DFT 2 n single precision, runtime of best found of other platforms n software adaptation is necessary e.g., ~50% performance loss by using PIII code on P4

Carnegie Mellon Conclusions  Mathematical computer representation of algorithms  Automatic translation of algorithms into code SPIRAL closes the gap between math domain (algorithms) and implementation domain (programs)  High level: Mathematical manipulation of algorithms  Low level: Coding degrees of freedom SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives

Carnegie Mellon References Related Work R.C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing Math-atlas.sourceforge.net M. Frigo and S.-G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. ICASSP 1998, pp E.-J. Im and K. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proc. ICCS 2001, pp Further Reading on SPIRAL M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. To appear in Journal of High Performance Computing and Applications. J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. PLDI 2001, pp Bryan Singer and Manuela Veloso. Automating the Modeling and Optimization of the Performance of Signal Transforms. IEEE Trans. Signal Processing, 50(8), 2002, pp F. Franchetti and M. Püschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proc. IPDPS F. Franchetti and M. Püschel. Short Vector Code Generation for the Discrete Fourier Transform. To appear in Proc. IPDPS