Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture was derived from material from Johnson, Johnson, Pryor.
Mar. 1, 2001Parallel Processing2 Introduction Objective: To derive and implement a distributed-memory parallel program for computing the fast Fourier transform (FFT). Topics –Derivation of the FFT Iterative version Pease Algorithm & Generalizations Tensor permutations –Distributed implementation of tensor permutations stride permutation bit reversal –Distributed FFT
Mar. 1, 2001Parallel Processing3 FFT as a Matrix Factorization Compute y = F n x, where F n is n-point Fourier matrix.
Mar. 1, 2001Parallel Processing4 Matrix Factorizations and Algorithms function y = fft(x) n = length(x) if n == 1 y = x else % [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n); % [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1); % w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1)); % y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1] end
Mar. 1, 2001Parallel Processing5 Rewrite Rules
Mar. 1, 2001Parallel Processing6 FFT Variants Cooley-Tukey Recursive FFT Iterative FFT Vector FFT (Stockham) Vector FFT (Korn-Lambiotte) Parallel FFT (Pease)
Mar. 1, 2001Parallel Processing7 Example TPL Programs ; Recursive 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) (L 8 2)) ; Iterative 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2)) (tensor (F 2) (I 4)) (tensor (I 2) (L 4 2) (L 8 2))
Mar. 1, 2001Parallel Processing8 FFT Dataflow Different formulas for the FFT have different dataflow (memory access patterns). The dataflow in a class of FFT algorithms can be described by a sequence of permutations. An “FFT dataflow” is a sequence of permutations that can be modified with the insertion of butterfly computations (with appropriate twiddle factors) to form a factorization of the Fourier matrix. FFT dataflows can be classified wrt to cost, and used to find “good” FFT implementations.
Mar. 1, 2001Parallel Processing9 Distributed FFT Algorithm Experiment with different dataflow and locality properties by changing radix and permutations
Mar. 1, 2001Parallel Processing10 Cooley-Tukey Dataflow
Mar. 1, 2001Parallel Processing11 Pease Dataflow
Mar. 1, 2001Parallel Processing12 Tensor Permutations A natural class of permutations compatible with the FFT. Let be a permutation of {1,…,t} Mixed-radix counting permutation of vector indices Well-known examples are stride permutations and bit-reversal.
Mar. 1, 2001Parallel Processing13 Example (Stride Permutation)
Mar. 1, 2001Parallel Processing14 Example (Bit Reversal)
Mar. 1, 2001Parallel Processing15 Twiddle Factor Matrix Diagonal matrix containing roots of unity Generalized Twiddle (compatible with tensor permutations)
Mar. 1, 2001Parallel Processing16 Distributed Computation Allocate equal-sized segments of vector to each processor, and index distributed vector with pid and local offset. Interpret tensor product operations with this addressing scheme b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset
Mar. 1, 2001Parallel Processing17 Distributed Tensor Product and Twiddle Factors Assume P processors I n A, becomes parallel do over all processors when n P. Twiddle factors determined independently from pid and offset. Necessary bits determined from I, J, and (n 1,…,n t ) in generalized twiddle notation.
Mar. 1, 2001Parallel Processing18 Distributed Tensor Permutations b (k+l-1) … b (l) b (l-1) ………... b (1) b (0) b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset
Mar. 1, 2001Parallel Processing19 Classes of Distributed Tensor Permutations 1Local (pid is fixed by ) Only permute elements locally within each processor 2Global (offset is fixed by ) Permute the entire local arrays amongst the processors 3Global*Local (bits in pid and bits in offset moved by , but no bits cross the pid/offset boundary) Permute elements locally followed by a Global permutation 4Mixed (at least one offset and pid bit are exchanged) Elements from a processor are sent/received to/from more than one processor
Mar. 1, 2001Parallel Processing20 Distributed Stride Permutation 000| 0 000|0 000| 1 100|0 001| 0 000|1 001| 1 100|1 010| 0 001|0 010| 1 101|0 011| 0 001|1 011| 1 101|1 100| 0 010|0 100| 1 110|0 101| 0 010|1 101| 1 110|1 110| 0 011|0 110| 1 111|0 111| 0 011|1 111| 1 111|1
Mar. 1, 2001Parallel Processing21 Communication Pattern X(0:2:6) Y(4:1:3) X(1:2:7) Y(0:1:7)
Mar. 1, 2001Parallel Processing22 Communication Pattern Each PE sends 1/2 data to 2 different PEs
Mar. 1, 2001Parallel Processing23 Communication Pattern Each PE sends 1/4 data to 4 different PEs
Mar. 1, 2001Parallel Processing24 Communication Pattern Each PE sends1/8 data to 8 different PEs
Mar. 1, 2001Parallel Processing25 Implementation of Distributed Stride Permutation D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y = L^N_S X // Inputs // Y,X distributed vectors of size N = 2^t, // with M = 2^l elements per processor // P = 2^k = number of processors // S = 2^j, 0 <= j <= k, is the stride. // Output // Y = L^N_S X p = pid for i=0,...,2 j -1 do put x(i:S:i+S*(n/S-1)) in y((n/S)*(p mod S):(n/S)*(p mod S)+N/S-1) on PE p/2^j + i*2^{k-j}
Mar. 1, 2001Parallel Processing26 Cyclic Scheduling Each PE sends 1/4 data to 4 different PEs
Mar. 1, 2001Parallel Processing27 Distributed Bit Reversal Permutation Mixed tensor permutation Implement using factorization b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 5 b 6 b 7 b 0 b 1 b 2 b 3 b 4 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7
Mar. 1, 2001Parallel Processing28 Experiments on the CRAY T3E All experiments were performed on a 240 node (8x4x8 with partial plane) T3E using 128 processors (300 MHz) with 128MB memory –Task 1(pairwise communication) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 2 (all 7! = 5040 global tensor permutations) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 3 (local tensor permutations of the form I L I on vectors of size 2^22 words - only run on a single node) Implemented using streams on/off, cache bypass –Task 4 (distributed stride permutations) Implemented using shmem_iput, shmem_iget, and mpi_sendrecv
Mar. 1, 2001Parallel Processing29 Task 1 Performance Data
Mar. 1, 2001Parallel Processing30 Task 2 Performance Data
Mar. 1, 2001Parallel Processing31 Task 3 Performance Data
Mar. 1, 2001Parallel Processing32 Task 4 Performance Data
Mar. 1, 2001Parallel Processing33 Network Simulator An idealized simulator for the T3E was developed (with C. Grassl from Cray research) in order to study contention –Specify processor layout and route table and number of virual processors with a given start node –Each processor can simultaneously issue a single send Contention is measured as the maximum number of messages across any edge/node Simulator used to study global and mixed tensor permutations.
Mar. 1, 2001Parallel Processing34 Task 2 Grid Simulation Analysis
Mar. 1, 2001Parallel Processing35 Task 2 Grid Simulation Analysis
Mar. 1, 2001Parallel Processing36 Task 2 Torus Simulation Analysis
Mar. 1, 2001Parallel Processing37 Task 2 Torus Simulation Analysis