FFTs, Portability, & Performance

Slides:



Advertisements
Similar presentations
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Advertisements

DSPs Vs General Purpose Microprocessors
Instruction Set Design
INSTRUCTION SET ARCHITECTURES
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Data Parallel Algorithms Presented By: M.Mohsin Butt
A Fast Fourier Transform Compiler Silvio D Carnevali.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
CH12 CPU Structure and Function
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 4.2 MARIE This is the MARIE architecture shown graphically.
1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Discrete Fourier Transform
CMSC 611: Advanced Computer Architecture
Advanced Algorithms Analysis and Design
Code Optimization.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Advanced Architectures
Ioannis E. Venetis Department of Computer Engineering and Informatics
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
CS101 Introduction to Computing Lecture 19 Programming Languages
Real-Time Ray Tracing Stefan Popov.
Algorithms Furqan Majeed.
Fast Fourier Transforms Dr. Vinu Thomas
Fast Fourier Transform
Automatic Performance Tuning
Pipelining and Vector Processing
Centar ( Global Signal Processing Expo
Concise guide on numerical methods
Central Processing Unit
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Compiler Back End Panel
Objective of This Course
Unit-2 Divide and Conquer
Real-time 1-input 1-output DSP systems
Arithmetic Logical Unit
Compiler Back End Panel
Fast Fourier Transformation (FFT)
Multivector and SIMD Computers
EE 312 Software Design and Implementation I
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
Mastering Memory Modes
Introduction to Microprocessor Programming
Analysis of Algorithms
Chapter 12 Pipelining and RISC
Chapter 19 Fast Fourier Transform
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Fast Fourier Transform (FFT) Algorithms
Lecture 4: Instruction Set Design/Pipelining
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Pointer analysis John Rollinson & Kaiyuan Li
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

FFTs, Portability, & Performance Steven G. Johnson, MIT Matteo Frigo, ITA Software (formerly MIT LCS)

Fast Fourier Transforms (FFTs) [Gauss (1805), Cooley & Tukey (1965), & others…] • Fast, O(n log n) algorithm(s) to compute the Discrete Fourier Transform (DFT) of n points • Widely used in many fields, including to solve partial differential equations: e.g. electromagnetic eigenmodes of photonic crystals • I couldn’t find any FFT code that was simultaneously: fast (near-optimal), flexible, and portable with M. Frigo, wrote a new one: (starting 1997) (fftw.org) the “Fastest Fourier Tranform in the West”

What’s the fastest algorithm for _____ What’s the fastest algorithm for _____? (computer science = math + time = math + $) Find best asymptotic complexity naïve DFT to FFT: O(n2) to O(n log n) 1 last 15+ years: flop count (varies by ~20%) no longer determines speed (varies by factor of ~10) Find best exact operation count? flops ~ 5 n log n [Cooley-Tukey, 1965, “radix 2”] to ~ 4 n log n [Yavne, 1968, “split radix”] to … 2

Better to change the question… What’s the fastest algorithm for _____? (computer science = math + time = math + $) Find best asymptotic complexity naïve DFT to FFT: O(n2) to O(n log n) 1 Find variant/implementation that runs fastest hardware-dependent — unstable answer! 2

A question with a more stable answer? What’s the simplest/smallest set of algorithmic steps whose compositions ~always span the ~fastest algorithm?

FFTW the “Fastest Fourier Tranform in the West” • C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) • Computational kernels (80% of code) automatically generated • Self-optimizes for your hardware (picks best composition of steps) = portability + performance free software: http://www.fftw.org/

FFTW performance power-of-two sizes, double precision 833 MHz Alpha EV6 2 GHz PowerPC G5 2 GHz AMD Opteron 500 MHz Ultrasparc IIe

FFTW performance non-power-of-two sizes, double precision unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6 2 GHz AMD Opteron …because we let the code do the optimizing

FFTW performance double precision, 2 FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two …because we let the code write itself

Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 1 2 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler”

many transforms of same size are required. FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ... execute(p); /* repeat as needed */ destroy_plan(p); } Key fact: usually, many transforms of same size are required.

Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2

Cooley-Tukey FFTs = ~2d DFT of size p x q n = pq 1d DFT of size n: [ 1965 … or Gauss, 1802 ] n = pq 1d DFT of size n: = ~2d DFT of size p x q multiply by n “twiddle factors” p q = contiguous q p transpose first DFT columns, size q (non-contiguous) finally, DFT columns, size p (non-contiguous)

Cooley-Tukey FFTs O(n2) O(n log n) = Recursive DFTs of sizes p and q [ 1965 … or Gauss, 1802 ] n = pq 1d DFT of size n: = ~2d DFT of size p x q (+ phase rotation by twiddle factors) = Recursive DFTs of sizes p and q O(n2) O(n log n) (divide-and-conquer algorithm)

Cooley-Tukey FFTs twiddles size-p DFTs size-q DFTs [ 1965 … or Gauss, 1802 ] twiddles size-p DFTs size-q DFTs

Cooley-Tukey is Naturally Recursive But traditional implementation is non-recursive, breadth-first traversal: log2 n passes over whole array Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT

Traditional cache solution: Blocking Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache …requires program specialized for cache size

Recursive Divide & Conquer is Good (depth-first traversal) [Singleton, 1967] Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT eventually small enough to fit in cache …no matter what size the cache is

Cache Obliviousness • A cache-oblivious algorithm does not know the cache size — it can be optimal for any machine & for all levels of cache simultaneously • Exist for many other algorithms, too [Frigo et al. 1999] — all via the recursive divide & conquer approach

Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2

The Codelet Generator a domain-specific FFT “compiler” • Generates fast hard-coded C for FFTs of arbitrary size Necessary to give the planner a large space of codelets to experiment with (any factorization). Exploits modern CPU deep pipelines & large register sets. Allows easy experimentation with different optimizations & algorithms. …CPU-specific hacks (SIMD) feasible

The Codelet Generator Abstract FFT algorithm n Symbolic graph (dag) written in Objective Caml [Leroy, 1998], an ML dialect Abstract FFT algorithm n Cooley-Tukey: n=pq, Prime-Factor: gcd(p,q) = 1, Rader: n prime, … Symbolic graph (dag) Simplifications powerful enough to e.g. derive real-input FFT from complex FFT algorithm and even find new algorithms Cache-oblivious scheduling (cache = registers) Optimized C code (or other language)

The Generator Finds Good/New FFTs

Symbolic Algorithms are Easy Cooley-Tukey in OCaml

Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a Constant folding Common-subexpression elimination

Symbolic Pattern Matching in OCaml The following actual code fragment is solely responsible for simplifying multiplications: stimesM = function | (Uminus a, b) -> stimesM (a, b) >>= suminusM | (a, Uminus b) -> stimesM (a, b) >>= suminusM | (Num a, Num b) -> snumM (Number.mul a b) | (Num a, Times (Num b, c)) -> snumM (Number.mul a b) >>= fun x -> stimesM (x, c) | (Num a, b) when Number.is_zero a -> snumM Number.zero | (Num a, b) when Number.is_one a -> makeNode b | (Num a, b) when Number.is_mone a -> suminusM b | (a, b) when is_known_constant b && not (is_known_constant a) -> stimesM (b, a) | (a, b) -> makeNode (Times (a, b)) (Common-subexpression elimination is implicit via “memoization” and monadic programming style.)

Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a Constant folding Common-subexpression elimination FFT-specific optimizations: _________________ negative constants… Network transposition (transpose + simplify + transpose)

A Quiz: Is One Faster? Faster because no separate load for -0.5 Both compute the same thing, and have the same number of arithmetic operations: Faster because no separate load for -0.5 a = 0.5 * b; c = 0.5 * d; e = 1.0 + a; f = 1.0 - c; a = 0.5 * b; c = -0.5 * d; e = 1.0 + a; f = 1.0 + c; 10–15% speedup

Non-obvious transformations require experimentation

Quiz 2: Which is Faster? up to ~20% speedup accessing strided array inside codelet (amid dense numeric code) array[stride * i] array[strides[i]] This is faster, of course! Except on brain-dead architectures… using precomputed stride array: strides[i] = stride * i …namely, Intel Pentia: integer multiplication conflicts with floating-point up to ~20% speedup

Machine-specific hacks are feasible if you just generate special code stride precomputation SIMD instructions (SSE, Altivec, 3dNow!) fused multiply-add instructions…

Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2

What does the planner compose? (new in FFTW 3, spring 2003) • The Cooley-Tukey algorithm presents many choices: — which factorization? what order? memory reshuffling? Find simple steps that combine without restriction to form many different algorithms. steps solve a problem, specified as a DFT(v,n): multi-dimensional “vector loops” v of multi-dimensional transforms n {sets of (size, input/output strides)}

Some Composable Steps (out of ~16) SOLVE — Directly solve a small DFT by a codelet CT-FACTOR[r] — Radix-r Cooley-Tukey step r (loop) sub-problems of size n/r VECLOOP — Perform one vector loop (can choose any loop, i.e. loop reordering) INDIRECT — DFT = copy + in-place DFT (separates copy/reordering from DFT) TRANPOSE — solve in-place m  n transpose

Dynamic Programming the assumption of “optimal substructure” Try all applicable steps: CT-FACTOR[2]: 2 DFT(8) CT-FACTOR[4]: 4 DFT(4) DFT(16) = fastest of: CT-FACTOR[2]: 2 DFT(4) CT-FACTOR[4]: 4 DFT(2) SOLVE[1,8] DFT(8) = fastest of: If exactly the same problem appears twice, assume that we can re-use the plan.

Many Resulting “Algorithms” • INDIRECT + TRANSPOSE gives in-place DFTs, — bit-reversal = product of transpositions … no separate bit-reversal “pass” [ Johnson (unrelated) & Burrus (1984) ] • VECLOOP can push topmost loop to “leaves” — “vector” FFT algorithm [ Swarztrauber (1987) ] • CT-FACTOR then VECLOOP(s) gives “breadth-first” FFT, — erases iterative/recursive distinction

Unpredictable: (automated) experimentation is the only solution. Actual Plan for size 219=524288 (2GHz Pentium IV, double precision, out-of-place) CT-FACTOR[4] (buffered variant) CT-FACTOR[32] (buffered variant) VECLOOP(b) x32 ~2000 lines hard-coded C! CT-FACTOR[64] INDIRECT + VECLOOP(b) (+ …) = huge improvements for large 1d sizes INDIRECT VECLOOP(b) x64 VECLOOP(a) x4 COPY[64] VECLOOP(a) x4 SOLVE[64, 64] Unpredictable: (automated) experimentation is the only solution.

Planner Unpredictability double-precision, power-of-two sizes, 2GHz PowerPC G5 FFTW 3 Classic strategy: minimize op’s fails badly another test: Use plan from: another machine? e.g. Pentium-IV? … lose 20–40% heuristic: pick plan with fewest adds + multiplies + loads/stores

We’ve Come a Long Way 1965 Cooley & Tukey, IBM 7094, 36-bit single precision: size 2048 DFT in 1.2 seconds 2003 FFTW3+SIMD, 2GHz Pentium-IV 64-bit double precision: size 2048 DFT in 50 microseconds (24,000x speedup) (= 30% improvement per year)

We’ve Come a Long Way? In the name of performance, computers have become complex & unpredictable. • Optimization is hard: simple heuristics (e.g. fewest flops) no longer work. • The solution is to avoid the details, not embrace them: (Recursive) composition of simple modules + feedback (self-optimization) High-level languages (not C) & code generation are a powerful tool for high performance. •

FFTW Homework Problems? • Try an FFTPACK-style back-and-forth solver • Implement Vector-Radix for multi-dimensional n • Pruned FFTs: VECLOOP that skips zeros • Better heuristic planner —some sort of optimization of per-solver “costs?” • Implement convolution solvers