Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs* September 4, 1997 Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs* Jeremy R. Johnson Wed. Feb. 14, 2001 *Parts of this lecture was derived from chapters IX in Lipson. Feb. 14, 2001 Parallel Processing
September 4, 1997 Introduction Objective: To derive and implement a shared-memory parallel program for computing the fast Fourier transform (FFT). Topics Derivation of the FFT Recursive version Iterative version A parallel divide & conquer algorithm using threads A parallel loop version using OpenMP Obtaining additional parallelism Feb. 14, 2001 Parallel Processing
FFT as a Matrix Factorization Compute y = Fnx, where Fn is n-point Fourier matrix. Feb. 14, 2001 Parallel Processing
Matrix Factorizations and Algorithms function y = fft(x) n = length(x) if n == 1 y = x else % [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n); % [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1); % w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1)); % y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1] end Feb. 14, 2001 Parallel Processing
Rewrite Rules Feb. 14, 2001 Parallel Processing
FFT Variants Cooley-Tukey Recursive FFT Iterative FFT Vector FFT (Stockham) Vector FFT (Korn-Lambiotte) Parallel FFT (Pease) Feb. 14, 2001 Parallel Processing
Tensor Permutations A natural class of permutations compatible with the FFT. Let be a permutation of {1,…,t} Mixed-radix counting permutation of vector indices Well-known examples are stride permutations and bit-reversal. Feb. 14, 2001 Parallel Processing
Example (Stride Permutation) 000 000 001 100 010 001 011 011 100 010 101 110 110 101 111 111 Feb. 14, 2001 Parallel Processing
Example (Bit Reversal) 000 000 001 100 010 010 011 110 100 001 101 101 110 011 111 111 Feb. 14, 2001 Parallel Processing
Iterative Cooley-Tukey Algorithm September 4, 1997 Iterative Cooley-Tukey Algorithm R Stage 0 Stage 1 Stage 2 Stage 3 Feb. 14, 2001 Parallel Processing
Iterative Cooley-Tukey Algorithm September 4, 1997 Iterative Cooley-Tukey Algorithm R Stage 0 Stage 1 Stage 2 Stage 3 Feb. 14, 2001 Parallel Processing
Modified Pease Algorithm September 4, 1997 Modified Pease Algorithm Stage 0 Stage 1 Stage 2 Stage 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 Feb. 14, 2001 Parallel Processing
Iterative Implementation function y = ifft2(x) % Input: x a vector of length n. n = 2^t, t an integer, t >= 0. % Output: y = F_{2^t} x % Algorithm: Iterative. % F_{2^t} = { Prod_{c=1}^t (I_{2^{c-1}} @ F_2 @ I_{2^{t-c}}) % (I_{2^{c-1}} @ T^{2^{t-c+1}}_{2^{t-c}}) } R^{2^t} n = length(x); t = ceil(log2(n)); xt = bitreversal(x); yt = zeros(n,1); for c=t:-1:1 m = 2^(c-1); p = 2^(t-c); % W = W_p(omega_{2p}) W = exp((2*pi*i)/(2*p)*-(0:p-1)'); % yt = (I_m @ F_2 @ I_p)xt for j=0:m-1 % y^{2p}_{j*2p+1} = (F_2 @ I_p)T^{2p}_p x^{2p}_{j*2p+1} % = (F_2 @ I_p)(I_p $ W) x^{2p}_{j*2p+1} xt((j*2+1)*p+1:(j+1)*2*p) = W .* xt((j*2+1)*p+1:(j+1)*2*p); yt(j*2*p+1:(j*2+1)*p) = xt(j*2*p+1:(j*2+1)*p) + xt((j*2+1)*p+1:(j+1)*2*p); yt((j*2+1)*p+1:(j+1)*2*p) = xt(j*2*p+1:(j*2+1)*p) - xt((j*2+1)*p+1:(j+1)*2*p); end xt = yt; y = yt; Feb. 14, 2001 Parallel Processing
Iterative Implementation function y = ipfft2(x) % In-place Pease FFT algorithm. % Input: x a vector of length n. n = 2^t, t an integer, t >= 0. % Output: y = F_{2^t} x % Algorithm: Conjugated Pease. % F_{2^t} = { Prod_{c=1}^t L^n_{2^{t-c}}(I_{2^{t-1}} @ F_2)T_c L^n_{2^c} R^{2^t} % n = length(x); t = ceil(log2(n)); y = bitreversal(x); w = exp(-2*pi*i/n); for c=t-1:-1:0 for r=0:2^(t-1)-1 r0 = mod(r,2^c); r1 = floor(r/2^c); a0 = r0*2^(t-c) + r1; a1 = a0 + 2^(t-c-1); y0 = y(a0+1); y1 = w^(r1*2^c) * y(a1+1); y(a0+1) = y0 + y1; y(a1+1) = y0 - y1; end Feb. 14, 2001 Parallel Processing