Rader’s FFT algorithm acceleration using Maxeler Author: Tadej Matek
Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing, fast multiplication of polynomials ... Tadej Matek Source: http://fweb.wallawalla.edu/class-wiki/index.php/DFT_example_using_MATLAB_-_HW11 1/17
Fourier Transform and computers Transformation: Discrete Fourier Transform Time: O(n2) Algorithm(s): Fast Fourier Transform (FFT) (Cooley-Tukey, Bruun’s FFT, Rader’s FFT, Bluestein’s FFT …) Time: O(nlogn) Tadej Matek 2/17
Why is FFT faster than DFT Divide & conquer + properties of primitive roots Primitive root of unity: Conquer step (butterfly): Source: http://mathworld.wolfram.com/images/gifs/rootsu.gif Tadej Matek 3/17
Rader’s FFT algorithm overview Primitive root defined as: Bit reversal revk(i): rev4(3): 3(10) = 0011(2) → 1100(2) = 12(10) Tadej Matek 4/17
Example of calculation n = 4 k = log(n) = 2 z = 5 p = 13 8, 2, 2, 4 i = 0 s = revk(i) = 2 s = revk(i) = 0 i = 1 10, 6 6, 11 8+z0*2 % 13 = 10 8+z2*2 % 13 = 6 2+z0*4 % 13 = 6 2+z2*4 % 13 = 11 i = 0 i = 1 i = 2 i = 3 s = 0 s = 3 s = 2 s = 1 3 4 9 3 Tadej Matek 5/17
Example: fast multiplication How to multiply two large polynomials? Basic approach: multiply each component of 1st with each component of 2nd -> O(n2) Using FFT: compute DFT transform of both polynomials, multiply in O(n) time and do inverse FFT -> O(nlogn) Tadej Matek 6/17
Dataflow implementation (1) 8, 2, 2, 4 Data dependency! 10, 6 6, 11 3 4 9 3 Kernel needs updated data for each level! Solution: LMem 7/17 Tadej Matek
Dataflow implementation (2) Input sequence Call kernel k times CPU (1) (3) (2) ... Output sequence Kernel Manager Manager streams data in and out of Kernel LMem Tadej Matek 8/17
Dataflow implementation (3) LMem works in bursts (example: 384 B, but depends on DFE) Good for consecutive calculations zs are calculated on CPU and written to LMem Tadej Matek 9/17
Performance & results (1) CPU used for testing: Intel Core2 Quad Processor Q9400 2.86GHz Maxeler card of type MAX2336B was used for DFE testing Tadej Matek 10/17
Performance & results (2) Conditions: BIG data, 95% run time in loops Type of experiments: consecutive calculations starting from 10K and up to 10M Consecutive calculations for input sequences of length 32, 64, 128 and 256 Tadej Matek 11/17
Performance & results (3) Execution time, N = 32, for CPU and DFE Tadej Matek 12/17
Performance & results (4) Speedup according to the number of consecutive calculations for N = 32 13/17 Tadej Matek
Performance & results (5) Speedup according to the number of consecutive calculations for N = 64 Tadej Matek 14/17
Performance & results (6) Speedup according to the number of consecutive calculations for N = 256 15/17 Tadej Matek
Performance & results (7) Speedup according to the size of input sequence (for 100K calculations) 16/17 Tadej Matek
Conclusion FFTs are one of the most used algorithms today There can be massive speedup but the requirement are consecutive calculations Power usage: reduced due to lower frequency (200Mhz vs 2.86GHz) Tadej Matek 17/17