SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Chapter 15 Digital Signal Processing
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Short Vector SIMD Code Generation for DSP Algorithms
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
High Performance Linear Transform Program Generation for the Cell BE
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Generative Programming. Automated Assembly Lines.
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
Compressed Abstract Syntax Trees as Mobile Code Christian H. Stork Vivek Haldar University of California, Irvine.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Introduction Why are virtual machines interesting?
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Custom Reduction of Arithmetic in Linear DSP Transforms S. Misra, A. Zelinski, J. C. Hoe, and M. Püschel Dept. of Electrical and Computer Engineering Carnegie.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Vector Processing => Multimedia
INTRODUCTION TO BASIC MATLAB
Automatic Performance Tuning
High Performance Computing (CS 540)
OSU Quantum Information Seminar
<PETE> Shape Programmability
VSIPL++: Parallel Performance HPEC 2004
Presentation transcript:

SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)

Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.

Moore’s Law and High(est) Performance Scientific Computing  arithmetic cost model not accurate for predicting runtime  better performance models hard to get  best code is machine dependent (registers/caches size, structure)  hand-tuned code becomes obsolete as fast as it is written  compiler limitations  full performance requires (in part) assembly coding Moore’s Law:  processor-memory bottleneck  short life cycles of computers  very complex architectures vendor specific special instructions (MMX, SSE, FMA, …) undocumented features (single processor, off-the-shelf) Consequences for software/algorithms: Portable performance requires automation

SPIRAL Automates  cuts development costs  code less error-prone  takes advantage of architecture specific features  porting without loss of performance  systematic exploration of alternatives both at algorithmic and code level  are performance critical Implementation Platform-Adaptation Optimization of DSP algorithms A library generator for highly optimized signal processing algorithms

SPIRAL system DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back Mathematician Expert Programmer

DSP Transform Algorithm

DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4):  product of structured sparse matrices  mathematical notation Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

DSP Algorithms: Terminology Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code

DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform

Rules = Breakdown Strategies base case recursive translation iterative recursive translation iterative/ recursive

Formula for a DCT, size 16

Algorithm (Formula) Implementation DSP Transform

Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )

SPL Syntax (Subset)  matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...)  direct matrix description: (matrix (a11 a12...) (a21 a22...)...) (diagonal (d1 d2...)) (permutation (p1 p2...))  parameterized matrices: (I n) (F n)  scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04  definition of new symbols: (define name formula) (template formula (i-code-list)  directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling

SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 fast algorithm as formula as SPL program #codetype complexreal

SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code C, FORTRAN function Template Table SPL FormulaTemplate DefinitionSymbol Definition Optimization I-Code SPL Program Built-in optimizations:  single static assignment code  no reuse of temporary vars  only scalar temporary vars  constants precomputed  limited CSE Extensible through templates

Algorithm (Formula) Implementation DSP Transform Search

Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost best formula is platform-dependent ~31000 formulas

Search Methods available in SPIRAL  Exhaustive Search  Dynamic Programming (DP)  Random Search  Hill Climbing  STEER (similar to a genetic algorithm) PossibleFormulas SizesTimedResults ExhaustVery smallAllBest DPAll10s-100s(very) good RandomAllUser decidedfair/good Hill ClimbingAll100s-1000sGood STEERAll100s-1000s(very) good Search over algorithm space and implementation options (degree of unrolling)

STEER Population n: Population n+1: …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest

Experimental Results (C code) high performance code (compared with FFTW) different transforms search methods (applicable to all transforms) generated high quality code

SPIRAL System  Available for download (v3.1):  Easy installation (Unix: configure/make; Windows: install shield)  Unix/Linux and Windows 98/ME/NT/2000/XP  Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants  Extensible  New version (4.0) in preparation

Recent Work

Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT

SIMD Short Vector Extensions + x vector length = 4 (4-way)  Extension to instruction set architecture  Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4)  Originally for multimedia (like MMX for integers)  Requires fine grain parallelism  Large potential speed-up  SIMD instructions are architecture specific  No common API (usually assembly hand coding)  Performance very sensitive to memory access  Automatic vectorization very limited Problems: very difficult to use

Vector code generation from SPL formulas Naturally vectorizable construct A xy vector length P i, Q i permutations D i, E i diagonals A i arbitrary formulas νSIMD vector length (Current) generic construct completely vectorizable: Vectorization in two steps: 1.Formula manipulation using manipulation rules 2.Code generation (vector code + C code)

Generated Vector Code DFTs: Pentium 4 gflops DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n  speedups (to C code) up to factor of 3.3  beats hand-tuned vendor library

Generated Vector Code, Other Transforms normalized runtime transform size 2-dim DCT WHT speedups up to factor of 2.5

Flexible Loop Interleaving (Runtime Gain WHT) Alpha 21264: up to 70% UltraSparc III: up to 60% Athlon XP: up to 55% Pentium 4: up to 45%

Parallel Code Generation: Example WHT WHT size log(N) 1 thread 8 threads 10 threads PowerPC RS64 III Parallelized constructs: I n  A, A  I n

Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled

Filters and Wavelets New constructs: row/column overlapped tensor product A A A A A A A A Examples for rules:

Conclusions  Automatic code generation for the entire domain of (linear) DSP algorithms  Portable high performance across platforms and across time  Integration of math (high) level and implementation (low) level  Intelligence through search and learning in the space of alternatives

Future Plans  Transforms: Radon, Gabor, Hankel, structured matrices, …  Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs  Instructions: Vector, FMAs, prefetching, OpenMP, MPI  Beyond transforms: entire DSP applications  Other domains amenable to SPIRAL approach?

Questions?

Extra Slides

Generating Parallel Programs  Interpret constructs such as I n  A as parallel operations and transform formulas to obtain maximal parallelism.  Explore alternative data access patterns mathematically (e.g. different permutations in matrix factorizations)  Prototype implementation using WHT Build on existing sequential package SMP implementation using OpenMP ( IPDPS’02) 90% efficiency obtained on 12 processor PowerPC RS64 III Distributed memory implementation using MPI ( POHLL’02)

Comparison of Parallel DDL Schemes WHT size log(N) Speedup Best sequential with DDL Parallel without DDL Coarse-grained DDL Fine-grained DDL with ID shift PowerPC RS64 III 10 threads

Overall Parallel Speedup WHT size log(N) Speedup 1 thread 8 threads 10 threads PowerPC RS64 III

Performance of Digit Permutations on CRAY T3E

Parameters: P i, 1  i  n, no. of processor Parameters: Dimension, and T i, 1  i  n, no. of processor 2 m AG CU Interconnection Network CUAG M CUAG M CUAG M CUAG M Host I/O Interface... M = Memory AG = Address Generator CU = Computation Unit Parameters: Dimension, P d I/O Interface Architecture Framework

Pease Algorithm Dataflow

Optimal Dataflow Stage 3Stage 1Stage 0Stage 2

Pease v.s. Optimal

Performance: Speedup Speedup = S/C C = no. of. clock cycles using 4-processor FFT engine S = minimum no. of clock cycles using single processor (4*2 n *n, 2 n is the number of points)

SPIRAL Approach DSP Transform (DFT, DCT, Wavelets etc.) Computing Platform given Possible Implementations Performance Evaluation Intelligent Search Possible Algorithms SPIRAL Search Space (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … ) adapted implementation

Classical Code Generation System DSP Transform (DFT, DCT, Wavelets etc.) Computing Platform given Expert Programmer Performance Evaluation Math/Algorithm Expert (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … ) adapted implementation

Algorithms = Ruletrees = Formulas R1 R3 R6 R4 R3 R1 R6 R4 R1 R6

Mathematical Framework: Summary  fast algorithms represented as ruletrees (easy generation/manipulation) and as formulas (can be translated into code)  formulas built from few constructs and primitives  many different algorithms/formulas generated from few rules (combinatorial explosion)  these algorithms are (essentially) equal in arithmetic cost, but differ in data flow

Formula Generation recursive application Formula Generator data base (extensible!) data type rules control transforms formulasruletrees translation search engine formula translation (spl compiler) export runtime  written in GAP/AREP (computer algebra system)  all computation/manipulation is symbolic  exact arithmetic  easy extensible rule and transform data base  verification of rules and formulas cut here for other optimization problems

Number of Formulas/Algorithms k k # DFT, size 2^k ~ ^27 ~ ^61 ~ ^133 # DCT-IV, size 2^k ~ ^38 ~ ^76 ~ ^153  differ in data flow not in arithmetic cost  exponential search space

Extensibility of SPIRAL New constructs and primitives (potentially required by radically different transforms) are readily included in SPL New transforms are readily included on the high level New instructions sets available (e.g., SSE) are included by extending the SPL compiler (easy, due to SPIRAL’s framework) (moderate effort, due to template mechanism) (doable one time effort)

Generated Vector Code DFTs: Pentium III gflops DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0 n speedups (to C code) up to factor of 2.5