Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Transform Techniques Mark Stamp Transform Techniques.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
More MR Fingerprinting
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Wavelet Transform. What Are Wavelets? In general, a family of representations using: hierarchical (nested) basis functions finite (“compact”) support.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Short Vector SIMD Code Generation for DSP Algorithms
Processor Architecture Needed to handle FFT algoarithm M. Smith.
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
High Performance Linear Transform Program Generation for the Cell BE
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
Implementing a Speech Recognition System on a GPU using CUDA
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Review Questions Jyun-Ming Chen Spring Wavelet Transform What is wavelet? How is wavelet transform different from Fourier transform? Wavelets are.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Basics Course Outline, Discussion about the course material, reference books, papers, assignments, course projects, software packages, etc.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.
DCT.
The Volcano Optimizer Generator Extensibility and Efficient Search.
JAVA AND MATRIX COMPUTATION
Wavelets and Multiresolution Processing (Wavelet Transforms)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
DSP-CIS Part-IV : Filter Banks & Subband Systems Chapter-12 : Frequency Domain Filtering Marc Moonen Dept. E.E./ESAT-STADIUS, KU Leuven
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Wavelet Transform Yuan F. Zheng Dept. of Electrical Engineering The Ohio State University DAGSI Lecture Note.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.
1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.
The Discrete Wavelet Transform for Image Compression Speaker: Jing-De Huang Advisor: Jian-Jiun Ding Graduate Institute of Communication Engineering National.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
VLSI Design of 2-D Discrete Wavelet Transform for Area-Efficient and High- Speed Image Computing - End Presentation Presentor: Eyal Vakrat Instructor:
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Custom Reduction of Arithmetic in Linear DSP Transforms S. Misra, A. Zelinski, J. C. Hoe, and M. Püschel Dept. of Electrical and Computer Engineering Carnegie.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
CMPT 438 Algorithms.
Design and Implementation of Lossless DWT/IDWT (Discrete Wavelet Transform & Inverse Discrete Wavelet Transform) for Medical Images.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Dynamo: A Runtime Codesign Environment
DCT – Wavelet – Filter Bank
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Implementation of DWT using SSE Instruction Set
Digital Image Procesing Discrete Walsh Trasform (DWT) in Image Processing Discrete Hadamard Trasform (DHT) in Image Processing DR TANIA STATHAKI READER.
Automatic Performance Tuning
VSIPL++: Parallel Performance HPEC 2004
Govt. Polytechnic Dhangar(Fatehabad)
Research Institute for Future Media Computing
Presentation transcript:

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura SPIRAL project Electrical and Computer Engineering Department Carnegie Mellon University

Carnegie Mellon We want to close this gap Automatic performance tuning ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL No filtering and wavelet kernels ! Existing FIR filtering and DWT software Matlab ©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP Automatic platform-adapted solutions not available ! High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić, Markus Püschel, José M. F. Moura SPIRAL project: Electrical and Computer Engineering Department Carnegie Mellon University Support: DARPA-DSO & NSF

Carnegie Mellon DSP transform specifies user S P I R A L Search Engine controls implementation options controls algorithm generation C/Fortran/SIMD code Formula Generator SPL Compiler fast algorithm as SPL formula goes for a coffee (or an espresso for small transforms) SPIRAL system Automatic generation of fast implementations for DSP runtime on given platform platform-adapted implementation comes back breakdown rules

Carnegie Mellon Arithmetic cost not a reliable predictor of performance Best implementation not obvious for a human programmer Best implementation is platform dependent Extremely large number of algorithms even for modest size transforms Automatically generated code includes implementation options Optimized through intelligent search Capture implementations in redesigned SPIRAL framework FIR Filters and Wavelets in SPIRAL Discrete Wavelet Transform FIR Filter Transform

Carnegie Mellon Motivation Fast Algorithm  Choose Fast Algorithm  Arithmetic cost – only a rough estimate of performance  Many algorithms with similar cost – which one? Efficient Code  Design Efficient Code  Architecture conscious – machine dependent  Obsolete when platform is changed/upgraded  Best implementation  Best implementation: Algorithm + Machine + Compiler  Requires experts in algorithms and computer architecture  Frequent re-implementation Better way: automatic performance tuning  FIR filters - image enhancement, equalization, speech synthesis, biomedical signal processing, …  Wavelets - image compression, detection, denoising, signal recovery  Numerically intensive - significant impact on application performance High-Performance Implementations (How-To)

Carnegie Mellon We want to close this gap Automatic performance tuning packages ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL Include several basic linear algebra and DSP functions No filtering and wavelet kernels ! Existing FIR filtering and DWT software Matlab ©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP Application oriented – not machine specific, or Hand-tuned code for specific platform Automatic platform-adapted solutions not available !

Carnegie Mellon SPIRAL system Generator of optimized DSP transform implementations DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back

Carnegie Mellon SPIRAL’s Four Key Concepts Rule Formula a breakdown strategy - product of sparse matrices captures important structural information recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Rule Tree few constructs and primitives uniquely defines an algorithm can be translated into code Transform parameterized matrix Diagonal matrix (twiddles) Kronecker product Identity Permutation

Carnegie Mellon Rule Trees = Formulas = Algorithms Rule Trees Formulas Algorithms Transform Rule Increasing level of abstraction

Carnegie Mellon Number of rule trees grows exponentially with  Size of the transform  Number of applicable rules Trigonometric Rule DCT-Recursive Rule Type-Split Rule Examples: Rules and Rule Trees Type-Conversion Rule SPIRAL also includes search over implementation options, such as the degree of loop unrolling Recursive RHT Rule

Carnegie Mellon FIR Filters and the DWT in SPIRAL: Challenges  Existing filtering and wavelet algorithm representations do not capture all important structural information  SPIRAL uses a concise math framework to represent many algorithms for several transforms (DFT, DCTs, DHT, RDFT, …)  Filtering and wavelet algorithms have considerably different structure from the current SPIRAL transforms  Current framework not enough to capture algorithms’ structure  Existing algorithm representations need to be “spiralized” – i.e., captured concisely using the rule formalism  Need changes on several levels:  New transforms, constructs, rules, translation templates  Modified search  Need to rephrase several concepts of the framework  Existing framework has to be redesigned to accommodate new constructions and decomposition strategies New framework has to be integrated in SPIRAL

Carnegie Mellon FIR Filters: Rules and Algorithms New constructs column and row overlapped direct sumoverlapped tensor product FIR Filter Transform Overlap-Add Rule Localized access of the input data Input is blocked into length b segments Time Domain Methods standard “sum” notation SPIRAL transform notation SPIRAL rule captures the structure concise rule notation Baseline Rule

Carnegie Mellon Blocking Rule Overlap-Save Rule Localized storage of the output Output segments are computed independently Toeplitz matrix Multilevel blocking Control over input & output locality Frequency Domain Methods Expanded Circulant Rule Cyclic-RDFT Rule Circulant matrixreal DFT Radix-2 Cooley- Tukey RDFT Rule Time domain - better data access structure Frequency domain - potentially reduce arithmetic cost sparse matrix

Carnegie Mellon Filter Bank – Mallat’s Algorithm DWT of {x n } DWT: Rules and Algorithms Mallat Rule Extension matrix Fused Mallat Rule single-level DWT Redundant computations removed Filter structure lost

Carnegie Mellon Primal (Predict) Dual (Update) Lifting steps Polyphase Representation Polyphase Rule Lifting Scheme Gateway to frequency domain computation Lifting Rule Asymptotically reduce cost by 50% In-place computations

Carnegie Mellon Examples: FIR Filter and DWT Rule Trees All these rule trees form an extensive search space of alternative algorithms in SPIRAL Exceedingly large number of trees even for modest size transforms

Carnegie Mellon Time-Domain Methods 60-70% improvement over baseline Same arithmetic cost unrolled loops 1- level blocking blocks size 2 Pentium 4 SUN overlap-add, overlap-save, and blocking all methods have the same cost baseline method = overlap-add with b=1 sanity check – percentage of the peak performance flops / peak performance comparison of different time-domain methods FIR Filter

Carnegie Mellon Frequency Domain vs. Time Domain circulant too sparse FD methods better on large “fat” circulants Filter lengths > 64 Considerable discrepancy between arithmetic costs and actual runtimes best time-domain FIR Filter arithmetic cost cross-over runtime cross-over Circulant “half–dense” circulants emerge in frequency domain methods Frequency domain – potentially lower cost Time domain – beter data access patterns

Carnegie Mellon Daubechies D 4 DWT XeonAthlon cache size reached direct method flops / peak performance for D 4 and D 30 single-level DWT using D 4 3 different classes of rule trees based on the initial rule Fused Mallat Rule (direct method) Polyphase Rule (frequency domain) Lifting Rule (lifting scheme) comparison of runtimes Frequency domain trees not found Lifting scheme – 30% lower cost – slower Short Wavelet

Carnegie Mellon Daubechies D 30 DWT Experiment Long Wavelet XeonAthlon cache size reached extension overhead lifting better temp cache Different methods are found for different size ranges Best implementations vary from machine to machine

Carnegie Mellon Summary  Fast automatically generated code for FIR filters and DWT  Significant contribution – no existing solutions  Preliminary results  Arithmetic cost not a reliable predictor of performance  Best implementation not obvious for a human programmer  Best implementations vary over different computer platforms What lies ahead?  Further algorithmic and implementation optimizations  Fusion of the extension, exploiting in-place structure, scheduling,..  Entire class of wavelets and extension methods  Orthogonal, biorthogonal, frames,…  Periodic, symmetric, polynomial extension, zero-padding  Extend to other wavelet transforms  2D-DWT, M-band, wavelet packet transform  Utilize special instruction set extensions  Single Instruction Multiple Data (SIMD)  Fused Multiply-Add (FMA)