CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF or by appointment Tel: Meeting: Mondays 6:00-8:50PM Baldwin 645
Outline Fourier analysis Discrete Fourier transform Fast Fourier transform Parallel implementation
Discrete Fourier Transform Many applications in science, engineering Examples – Voice recognition – Image processing Straightforward implementation: (n 2 ) Fast Fourier transform: (n log n) Parallel FFT (log n)
Fourier Analysis Fourier analysis: Represent periodic continuous functions by (potentially infinite) series of sine and cosine functions Discrete Fourier transform: Map a sequence over time to another sequence over frequency – Signal strength as a function of time – Fourier coefficients as a function of frequency
DFT Example (1/4) 16 data points representing signal strength over time
DFT Example (2/4) DFT yields amplitudes and frequencies of sine/cosine functions
DFT Example (3/4) Plot of four constituent sine/cosine functions and their sum
DFT Example (4/4) Continuous function and original 16 samples.
DFT of Speech Sample “An gorra cats are furrier...” Signal Frequency and amplitude Figure courtesy Ron Cole and Yeshwant Muthusamy of the Oregon Graduate Institute
Computing DFT Matrix-vector product F n x – x is input vector (n signal samples) – F n is the nth order Fourier Matrix – f i,j = n ij for 0 i, j < n and n is primitive nth root of unity
11 Discrete Fourier Transform Given a polynomial a 0 + a 1 x a n-1 x n-1, evaluate it at n distinct points x 0,..., x n-1. Key idea: choose x k = k where is principal n th root of unity.
Example 1 Compute DFT of vector (2, 3) 2, the primitive square root of unity, is -1
Example 2 Compute DFT of vector (1, 2, 4, 3) The primitive 4th root of unity is i
14 Roots of Unity Def. An n th root of unity is a complex number x such that x n = 1. Fact. The n th roots of unity are: 0, 1, …, n-1 where = e 2 i / n. Pf. ( k ) n = (e 2 i k / n ) n = (e i ) 2k = (-1) 2k = 1. Fact. The n/2 th roots of unity are: 0, 1, …, n/2-1 where = e 4 i / n. Fact. 2 = and ( 2 ) k = k.
11 2 = 1 = i 33 4 = 2 = -1 55 6 = 3 = -i 77 n = 8 0 = 0 = 1
16 Fast Fourier Transform via Divide and Conquer Goal. Evaluate a degree n-1 polynomial A(x) = a a n-1 x n-1 at its n th roots of unity: 0, 1, …, n-1. Divide. Break polynomial up into even and odd powers. – A even (x) = a 0 + a 2 x + a 4 x 2 + … + a n/2-2 x (n-1)/2. – A odd (x) = a 1 + a 3 x + a 5 x 2 + … + a n/2-1 x (n-1)/2. – A(x) = A even (x 2 ) + x A odd (x 2 ). Conquer. Evaluate degree A even (x) and A odd (x) at the ½n th roots of unity: 0, 1, …, n/2-1. Combine. – A( k+n ) = A even ( k ) + k A odd ( k ), 0 k < n/2 – A( k+n ) = A even ( k ) - k A odd ( k ), 0 k < n/2 k+n = - k k = ( k ) 2 = ( k+n ) 2
17 fft(n, a 0,a 1,…,a n-1 ) { if (n == 1) return a 0 (e 0,e 1,…,e n/2-1 ) FFT(n/2, a 0,a 2,a 4,…,a n-2 ) (d 0,d 1,…,d n/2-1 ) FFT(n/2, a 1,a 3,a 5,…,a n-1 ) for k = 0 to n/2 - 1 { k e 2 ik/n y k+n/2 e k + k d k y k+n/2 e k - k d k } return (y 0,y 1,…,y n-1 ) } FFT Algorithm
18 Odd-Even Recursion Tree a 0, a 1, a 2, a 3, a 4, a 5, a 6, a 7 a 1, a 3, a 5, a 7 a 0, a 2, a 4, a 6 a 3, a 7 a 1, a 5 a 0, a 4 a 2, a 6 a0a0 a4a4 a2a2 a6a6 a1a1 a5a5 a3a3 a7a7 "bit-reversed" order perfect shuffle
Phases of Parallel FFT Algorithm Phase 1: Processes permute a’s (global bit reversal data communication pattern) Phase 2: – First log n – log p iterations of FFT – Handled in shared memory -No global communication is required Phase 3: – Final log p iteration steps must be handled globally – Organized as logical hypercube – In each iteration every process swaps values with partner across a hypercube dimension
20 FFT in Practice: Sequential and Parallel Fastest Fourier transform in the West. [Frigo and Johnson] – Optimized C library. – Features: DFT, DCT, real, complex, any size, any dimension. – Won 1999 Wilkinson Prize for Numerical Software. – Portable, competitive with vendor-tuned code. The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the floating‐point performance of a GPU without having to develop your own custom GPU FFT. Reference:
Summary Discrete Fourier transform used in many scientific and engineering applications Fast Fourier transform important because it implements DFT in time (n log n) Developed parallel implementation of FFT Why isn’t scalability better? – (n log n) sequential algorithm – Parallel version requires bit reversal data exchange – Log n parallel phase steps