Download presentation
Presentation is loading. Please wait.
Published byWinfred Shields Modified over 9 years ago
1
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura SPIRAL project http://www.spiral.net Electrical and Computer Engineering Department Carnegie Mellon University
2
Carnegie Mellon We want to close this gap Automatic performance tuning ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL No filtering and wavelet kernels ! Existing FIR filtering and DWT software Matlab ©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP Automatic platform-adapted solutions not available ! High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić, Markus Püschel, José M. F. Moura SPIRAL project: http://www.spiral.net Electrical and Computer Engineering Department Carnegie Mellon University Support: DARPA-DSO & NSF
3
Carnegie Mellon DSP transform specifies user S P I R A L Search Engine controls implementation options controls algorithm generation C/Fortran/SIMD code Formula Generator SPL Compiler fast algorithm as SPL formula goes for a coffee (or an espresso for small transforms) SPIRAL system Automatic generation of fast implementations for DSP runtime on given platform platform-adapted implementation comes back breakdown rules
4
Carnegie Mellon Arithmetic cost not a reliable predictor of performance Best implementation not obvious for a human programmer Best implementation is platform dependent Extremely large number of algorithms even for modest size transforms Automatically generated code includes implementation options Optimized through intelligent search Capture implementations in redesigned SPIRAL framework FIR Filters and Wavelets in SPIRAL Discrete Wavelet Transform FIR Filter Transform
5
Carnegie Mellon Motivation Fast Algorithm Choose Fast Algorithm Arithmetic cost – only a rough estimate of performance Many algorithms with similar cost – which one? Efficient Code Design Efficient Code Architecture conscious – machine dependent Obsolete when platform is changed/upgraded Best implementation Best implementation: Algorithm + Machine + Compiler Requires experts in algorithms and computer architecture Frequent re-implementation Better way: automatic performance tuning FIR filters - image enhancement, equalization, speech synthesis, biomedical signal processing, … Wavelets - image compression, detection, denoising, signal recovery Numerically intensive - significant impact on application performance High-Performance Implementations (How-To)
6
Carnegie Mellon We want to close this gap Automatic performance tuning packages ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL Include several basic linear algebra and DSP functions No filtering and wavelet kernels ! Existing FIR filtering and DWT software Matlab ©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP Application oriented – not machine specific, or Hand-tuned code for specific platform Automatic platform-adapted solutions not available !
7
Carnegie Mellon SPIRAL system Generator of optimized DSP transform implementations DSP transform specifies user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options controls algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) platform-adapted implementation comes back
8
Carnegie Mellon SPIRAL’s Four Key Concepts Rule Formula a breakdown strategy - product of sparse matrices captures important structural information recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Rule Tree few constructs and primitives uniquely defines an algorithm can be translated into code Transform parameterized matrix Diagonal matrix (twiddles) Kronecker product Identity Permutation
9
Carnegie Mellon Rule Trees = Formulas = Algorithms Rule Trees Formulas Algorithms Transform Rule Increasing level of abstraction
10
Carnegie Mellon Number of rule trees grows exponentially with Size of the transform Number of applicable rules Trigonometric Rule DCT-Recursive Rule Type-Split Rule Examples: Rules and Rule Trees Type-Conversion Rule SPIRAL also includes search over implementation options, such as the degree of loop unrolling Recursive RHT Rule
11
Carnegie Mellon FIR Filters and the DWT in SPIRAL: Challenges Existing filtering and wavelet algorithm representations do not capture all important structural information SPIRAL uses a concise math framework to represent many algorithms for several transforms (DFT, DCTs, DHT, RDFT, …) Filtering and wavelet algorithms have considerably different structure from the current SPIRAL transforms Current framework not enough to capture algorithms’ structure Existing algorithm representations need to be “spiralized” – i.e., captured concisely using the rule formalism Need changes on several levels: New transforms, constructs, rules, translation templates Modified search Need to rephrase several concepts of the framework Existing framework has to be redesigned to accommodate new constructions and decomposition strategies New framework has to be integrated in SPIRAL
12
Carnegie Mellon FIR Filters: Rules and Algorithms New constructs column and row overlapped direct sumoverlapped tensor product FIR Filter Transform Overlap-Add Rule Localized access of the input data Input is blocked into length b segments Time Domain Methods standard “sum” notation SPIRAL transform notation SPIRAL rule captures the structure concise rule notation Baseline Rule
13
Carnegie Mellon Blocking Rule Overlap-Save Rule Localized storage of the output Output segments are computed independently Toeplitz matrix Multilevel blocking Control over input & output locality Frequency Domain Methods Expanded Circulant Rule Cyclic-RDFT Rule Circulant matrixreal DFT Radix-2 Cooley- Tukey RDFT Rule Time domain - better data access structure Frequency domain - potentially reduce arithmetic cost sparse matrix
14
Carnegie Mellon Filter Bank – Mallat’s Algorithm DWT of {x n } DWT: Rules and Algorithms Mallat Rule Extension matrix Fused Mallat Rule single-level DWT Redundant computations removed Filter structure lost
15
Carnegie Mellon Primal (Predict) Dual (Update) Lifting steps Polyphase Representation Polyphase Rule Lifting Scheme Gateway to frequency domain computation Lifting Rule Asymptotically reduce cost by 50% In-place computations
16
Carnegie Mellon Examples: FIR Filter and DWT Rule Trees All these rule trees form an extensive search space of alternative algorithms in SPIRAL Exceedingly large number of trees even for modest size transforms
17
Carnegie Mellon Time-Domain Methods 60-70% improvement over baseline Same arithmetic cost unrolled loops 1- level blocking blocks size 2 Pentium 4 SUN overlap-add, overlap-save, and blocking all methods have the same cost baseline method = overlap-add with b=1 sanity check – percentage of the peak performance flops / peak performance comparison of different time-domain methods FIR Filter
18
Carnegie Mellon Frequency Domain vs. Time Domain circulant too sparse FD methods better on large “fat” circulants Filter lengths > 64 Considerable discrepancy between arithmetic costs and actual runtimes best time-domain FIR Filter arithmetic cost cross-over runtime cross-over Circulant “half–dense” circulants emerge in frequency domain methods Frequency domain – potentially lower cost Time domain – beter data access patterns
19
Carnegie Mellon Daubechies D 4 DWT XeonAthlon cache size reached direct method flops / peak performance for D 4 and D 30 single-level DWT using D 4 3 different classes of rule trees based on the initial rule Fused Mallat Rule (direct method) Polyphase Rule (frequency domain) Lifting Rule (lifting scheme) comparison of runtimes Frequency domain trees not found Lifting scheme – 30% lower cost – slower Short Wavelet
20
Carnegie Mellon Daubechies D 30 DWT Experiment Long Wavelet XeonAthlon cache size reached extension overhead lifting better temp cache Different methods are found for different size ranges Best implementations vary from machine to machine
21
Carnegie Mellon Summary Fast automatically generated code for FIR filters and DWT Significant contribution – no existing solutions Preliminary results Arithmetic cost not a reliable predictor of performance Best implementation not obvious for a human programmer Best implementations vary over different computer platforms What lies ahead? Further algorithmic and implementation optimizations Fusion of the extension, exploiting in-place structure, scheduling,.. Entire class of wavelets and extension methods Orthogonal, biorthogonal, frames,… Periodic, symmetric, polynomial extension, zero-padding Extend to other wavelet transforms 2D-DWT, M-band, wavelet packet transform Utilize special instruction set extensions Single Instruction Multiple Data (SIMD) Fused Multiply-Add (FMA)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.