Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.

Similar presentations


Presentation on theme: "Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture."— Presentation transcript:

1 Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture was derived from material from Johnson, Johnson, Pryor.

2 Mar. 1, 2001Parallel Processing2 Introduction Objective: To derive and implement a distributed-memory parallel program for computing the fast Fourier transform (FFT). Topics –Derivation of the FFT Iterative version Pease Algorithm & Generalizations Tensor permutations –Distributed implementation of tensor permutations stride permutation bit reversal –Distributed FFT

3 Mar. 1, 2001Parallel Processing3 FFT as a Matrix Factorization Compute y = F n x, where F n is n-point Fourier matrix.

4 Mar. 1, 2001Parallel Processing4 Matrix Factorizations and Algorithms function y = fft(x) n = length(x) if n == 1 y = x else % [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n); % [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1); % w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1)); % y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1] end

5 Mar. 1, 2001Parallel Processing5 Rewrite Rules

6 Mar. 1, 2001Parallel Processing6 FFT Variants Cooley-Tukey Recursive FFT Iterative FFT Vector FFT (Stockham) Vector FFT (Korn-Lambiotte) Parallel FFT (Pease)

7 Mar. 1, 2001Parallel Processing7 Example TPL Programs ; Recursive 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) (L 8 2)) ; Iterative 8-point FFT (compose (tensor (F 2) (I 4)) (T 8 4) (tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2)) (tensor (F 2) (I 4)) (tensor (I 2) (L 4 2) (L 8 2))

8 Mar. 1, 2001Parallel Processing8 FFT Dataflow Different formulas for the FFT have different dataflow (memory access patterns). The dataflow in a class of FFT algorithms can be described by a sequence of permutations. An “FFT dataflow” is a sequence of permutations that can be modified with the insertion of butterfly computations (with appropriate twiddle factors) to form a factorization of the Fourier matrix. FFT dataflows can be classified wrt to cost, and used to find “good” FFT implementations.

9 Mar. 1, 2001Parallel Processing9 Distributed FFT Algorithm Experiment with different dataflow and locality properties by changing radix and permutations

10 Mar. 1, 2001Parallel Processing10 Cooley-Tukey Dataflow

11 Mar. 1, 2001Parallel Processing11 Pease Dataflow

12 Mar. 1, 2001Parallel Processing12 Tensor Permutations A natural class of permutations compatible with the FFT. Let  be a permutation of {1,…,t} Mixed-radix counting permutation of vector indices Well-known examples are stride permutations and bit-reversal. 

13 Mar. 1, 2001Parallel Processing13 Example (Stride Permutation) 000 001 100 010 001 011 100 010 101 110 110 101 111

14 Mar. 1, 2001Parallel Processing14 Example (Bit Reversal) 000 001 100 010 011 110 100 001 101 110 011 111

15 Mar. 1, 2001Parallel Processing15 Twiddle Factor Matrix Diagonal matrix containing roots of unity Generalized Twiddle (compatible with tensor permutations)

16 Mar. 1, 2001Parallel Processing16 Distributed Computation Allocate equal-sized segments of vector to each processor, and index distributed vector with pid and local offset. Interpret tensor product operations with this addressing scheme b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset

17 Mar. 1, 2001Parallel Processing17 Distributed Tensor Product and Twiddle Factors Assume P processors I n  A, becomes parallel do over all processors when n  P. Twiddle factors determined independently from pid and offset. Necessary bits determined from I, J, and (n 1,…,n t ) in generalized twiddle notation.

18 Mar. 1, 2001Parallel Processing18 Distributed Tensor Permutations b  (k+l-1) … b  (l) b  (l-1) ………... b  (1) b  (0) b k+l-1 ……b l b l-1 …...……... b 1 b 0 pidoffset

19 Mar. 1, 2001Parallel Processing19 Classes of Distributed Tensor Permutations 1Local (pid is fixed by  ) Only permute elements locally within each processor 2Global (offset is fixed by  ) Permute the entire local arrays amongst the processors 3Global*Local (bits in pid and bits in offset moved by , but no bits cross the pid/offset boundary) Permute elements locally followed by a Global permutation 4Mixed (at least one offset and pid bit are exchanged) Elements from a processor are sent/received to/from more than one processor

20 Mar. 1, 2001Parallel Processing20 Distributed Stride Permutation 000|  0 000|0  000|  1 100|0  001|  0 000|1  001|  1 100|1  010|  0 001|0  010|  1 101|0  011|  0 001|1   011|  1 101|1  100|  0 010|0  100|  1 110|0  101|  0 010|1  101|  1 110|1  110|  0 011|0  110|  1 111|0  111|  0 011|1   111|  1 111|1 

21 Mar. 1, 2001Parallel Processing21 Communication Pattern X(0:2:6) 0 1 2 3 4 5 6 7 Y(4:1:3) X(1:2:7) Y(0:1:7)

22 Mar. 1, 2001Parallel Processing22 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/2 data to 2 different PEs

23 Mar. 1, 2001Parallel Processing23 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs

24 Mar. 1, 2001Parallel Processing24 Communication Pattern 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends1/8 data to 8 different PEs

25 Mar. 1, 2001Parallel Processing25 Implementation of Distributed Stride Permutation D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y = L^N_S X // Inputs // Y,X distributed vectors of size N = 2^t, // with M = 2^l elements per processor // P = 2^k = number of processors // S = 2^j, 0 <= j <= k, is the stride. // Output // Y = L^N_S X p = pid for i=0,...,2 j -1 do put x(i:S:i+S*(n/S-1)) in y((n/S)*(p mod S):(n/S)*(p mod S)+N/S-1) on PE p/2^j + i*2^{k-j}

26 Mar. 1, 2001Parallel Processing26 Cyclic Scheduling 0 123 4 5 6 7 0 123 4 5 6 7 Each PE sends 1/4 data to 4 different PEs

27 Mar. 1, 2001Parallel Processing27 Distributed Bit Reversal Permutation Mixed tensor permutation Implement using factorization b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 5 b 6 b 7 b 0 b 1 b 2 b 3 b 4 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7

28 Mar. 1, 2001Parallel Processing28 Experiments on the CRAY T3E All experiments were performed on a 240 node (8x4x8 with partial plane) T3E using 128 processors (300 MHz) with 128MB memory –Task 1(pairwise communication) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 2 (all 7! = 5040 global tensor permutations) Implemented with shmem_get, shmem_put, and mpi_sendrecv –Task 3 (local tensor permutations of the form I  L  I on vectors of size 2^22 words - only run on a single node) Implemented using streams on/off, cache bypass –Task 4 (distributed stride permutations) Implemented using shmem_iput, shmem_iget, and mpi_sendrecv

29 Mar. 1, 2001Parallel Processing29 Task 1 Performance Data

30 Mar. 1, 2001Parallel Processing30 Task 2 Performance Data

31 Mar. 1, 2001Parallel Processing31 Task 3 Performance Data

32 Mar. 1, 2001Parallel Processing32 Task 4 Performance Data

33 Mar. 1, 2001Parallel Processing33 Network Simulator An idealized simulator for the T3E was developed (with C. Grassl from Cray research) in order to study contention –Specify processor layout and route table and number of virual processors with a given start node –Each processor can simultaneously issue a single send Contention is measured as the maximum number of messages across any edge/node Simulator used to study global and mixed tensor permutations.

34 Mar. 1, 2001Parallel Processing34 Task 2 Grid Simulation Analysis

35 Mar. 1, 2001Parallel Processing35 Task 2 Grid Simulation Analysis

36 Mar. 1, 2001Parallel Processing36 Task 2 Torus Simulation Analysis

37 Mar. 1, 2001Parallel Processing37 Task 2 Torus Simulation Analysis


Download ppt "Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture."

Similar presentations


Ads by Google