Parallelizing the Fast Fourier Transform David Monismith cs599
Outline An Example of a use of the Fast Fourier Transform (FFT) – Audio Processing. Explanation of the Discrete Fourier Transform. Making use of Divide and Conquer to implement the Recursive FFT (Cooley-Tukey Algorithm). Creating an iterative algorithm. Parallelization of the FFT.
Examples: Audio Processing Compression – In audio and video processing, often only certain frequencies can be heard or seen. – The FFT or a similar tool can be used to remove frequency data that cannot be seen or heard. – Only the important data (the frequencies we care about can be stored). – An inverse FFT (or similar operation) can be applied to decompress the data. Audio synthesis – Frequencies can be quickly added/adjusted and converted to a signal (a sound) by using the FFT. – Such operations are often applied in audio synthesizers.
Band Pass Filter Algorithm Convert signal to the frequency domain with FFT. Multiply desired frequencies by 1. Multiply remaining frequencies by zero. Apply the inverse FFT to convert the signal back to the space-time domain.
Discrete Fourier Transform Converts a sequence of values from space-time domain to frequency domain. Useful for signal processing. The standard DFT is too slow for practical use - it requires O(n 2 ) operations. Notice that each value is a summation of all the components in the space-time sequence.
Discrete Fourier Transform The nth root of unity is defined as ω n as shown below. The DFT can be performed as shown in the equation below. Many values in the matrix are repeated, but the repetition is not obvious.
Example of Repetition Assume that N = 5. Notice that the first root of unity is: The sixth root of unity is
Fast Fourier Transform (FFT) Repetition in the roots of unity is highest for the DFT when using array sizes of powers of two. This can be taken advantage of with a divide and conquer algorithm called the FFT. The FFT and can compute a DFT in O(n lg n) time by multiplying array values by the appropriate root of unity and adding the array value at an appropriate stride. This algorithm allows for fast compression and manipulation of signals.
Divide and Conquer in the FFT x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] s1[0] s1[1] s1[2] s1[3] s1[4] s1[5] s1[6] s1[7] s2[0] s2[1] s2[2] s2[3] s2[4] s2[5] s2[6] s2[7] X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7]
Cooley-Tukey FFT Algorithm y = fft(x, n, stride) if(n == 1) y[0] = x[0]; else y1 = fft(x, n/2, 2*stride); y2 = fft(x+stride, n/2, 2*stride); for(int i = 0; i < n/2; i++) y[i] = y1[i] + e^((2*PI*I*i)/n)*y2[i]; y[i+n/2] = y1[i] + e^((2*PI*I*(i+n/2))/n)*y2[i]; end for end if end fft
Example Code Let’s quickly take a look at some example code for the Recursive FFT. Given an array of values: [ 8, 7.0, 9.0, -1.3, 6.3, 8.5, 4.2, 9.1, -5.2 ] Matlab (and Octave) tell us that the fft is: [ *i, *i, *i, *i, *i, *i, *i, *i]
Iterative FFT y = fft(x, n) { r = ceil(log2(n)); //Allocate arrays R and S to be of size n R = x; for(m = 0; m < r; m++) { S[i] = R[i]; //Elements to add at each stage differ in exactly one bit. bit = 1 << (r - m - 1); notBit = ~bit; for(i = 0; i < n; i++) { j = i & notbit; k = i | bit; expFactor = revAndShift(i, r, m); R[i] = S[j] + S[k] * cexp( (2*PI*I*expFactor)/n ); } y = R; }
Reverse and Shift Function //Given i = b 0 b 1 b 2 … b r-1 obtain b m b m-1 … b 0 0 … 0 //Note that there are r - m - 1 zeros. result = revAndShift(unsigned int i, int r, int m) { i = i >> (r-m-1); //remove unwanted bits result = 0; for(int j = 0; j < m+1; j++) { result |= i & 1; i = i >> 1; if(j < m) result = result << 1; } //pad result with zeros result = result << (r-m-1); }
Parallelizing the FFT Assume the number of processes (i.e. running programs) is a power of 2. The number of processes will be referred to as npes. Assume the array size (N) is a power of 2 and is larger than the number of processes. Partition the array into N/npes chunks. Assign one chunk to each process.
Parallelizing the FFT x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] s1[0] s1[1] s1[2] s1[3] s1[4] s1[5] s1[6] s1[7] s2[0] s2[1] s2[2] s2[3] s2[4] s2[5] s2[6] s2[7] X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] Process 0Process 1 Process 2 Process 3
Parallelizing the FFT Notice that data must be sent to each process and received by each process in the first lg(npes) stages. Additionally, notice that within each process, in a stage where data transfer must occur, data must be sent to a process and received from the same process. This operation can be accomplished using a function called MPI_Sendrecv from the Message Passing Interface API. We will investigate the algorithm to perform this operation next.
Parallel Algorithm //Rank is the process id and npes is the //number of processes y = fft(x, n, rank, npes) { r = ceil(log2(n)); workToDo = n/npes; start = rank*workToDo; end = start + workToDo; //Allocate arrays S, Sk, and R of size workToDo R = x[start…end-1]; for(int m = 0; m < r; m++) { Sk = S = R; bit = 1 << (r - m - 1), notbit = ~bit; splitPoint = npes / (1 << (m+1));
Parallel FFT Algorithm, Cont’d if(splitPoint > 0) { if( ( rank % (splitPoint << 1) ) < splitPoint) Send S to process rank + splitPoint, and receive Sk from rank + splitPoint. else Send Sk to process rank - splitPoint, and receive S from rank – splitPoint. } else Sk = S;
Parallel FFT Algorithm, Cont’d for(int i = start, l = 0; l < workToDo; i++, l++) { j = (i & notbit) % workToDo; k = (i | bit) % workToDo; expFactor = revAndShift(i, r, m); R[l] = S[j] + Sk[k] * e^( (2*PI*I*expFactor)/n ); } y = R; }
References [1] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing 2 nd Edition, 2003 [2] J. Demmel, Fast Fourier Transform Lecture, Efficient Algorithms and Intractable Problems, Spring 2007, ctureNotes/Lecture_FFT.pdf ctureNotes/Lecture_FFT.pdf [3] Discrete Fourier Transform, m Additive Synthesis,