A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop September 2006
New Base-4 DFT Matrix Equation “ ”= element by element multiply Traditional DFT Matrix form: New Matrix form for DFT † C M 1 and C M 2 contain only elements from the set – C M 1 X and C M 2 Y t only involve complex additions/subtractions Twiddle factor matrix W M is of size N/4 x N/4 rather than N x N of C – x16 fewer multiplies than traditional DFT equation (Z=CX) † J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform, ” IEEE Transactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp – 4651.
Find Systolic Architecture Using SPADE † Mathematical Algorithm Automatic Search for Space-Time Transformations, T Input Code Simulator, Graphical Outputs for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; † Symbolic Parallel Algorithm Development Environment -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory FPGA Architectural Constraints Objective Functions
Functional Operation Processing flow for DFT of length N = N 1 * N 2 Stage 1: N 2 column DFTs (X ci ) of length N 1 Stage 2: Twiddle multiplication Stage 3: N 1 row DFTs (X ri ) of length N 2 Systolic adder arrays for matrix multiplication –N 1 /4 x 4 array for column multiplies C M1 X ci and C M2 Y t ci –N 2 /4 x 4 array for row multiplies C M1 X ri and C M2 Y t ri N 2 /4 x 4 array is implemented virtually on one row of N 1 /4 x 4 array Uses systolic 1-D array matrix multiplication
FFT Systolic Architecture Simple PEs, locally connected Higher clock speeds Easier design/test/maintainability Lower power Efficient use of FPGA fabric Simple control Small memory blocks (one per PE) Faster read/write times Lower power Linear structure (scales in N/S direction) Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers) Example Architecture for N = 1024 (N 1 = N 2 = 32)
Enhanced Functionality Transform size N not restricted to powers of two –N = 256n, (n = 1,2,3,..) –More reachable points –Uniform distribution of points Circuit is scalable –Any DFT size can be computed on the same hardware with sufficient memory –Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks Low computational latency –Pipeline depth small, vs for traditional pipelined FFTs 1-D and 2-D transforms possible on the same circuit
Block Floating Point/Floating Point Operation Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT) –Column DFTs use block floating point and row DFTs use floating point –Higher dynamic range and lower signal to noise ratio Number of regions increases with transform size Supports streaming FFT’s Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):
Performance Comparison: 256-point DFT Altera block floating point circuit “Streaming” (continuous data in and out) Comparable dynamic range and signal to (roundoff) noise ratio Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA Altera circuit from Megacore FFT v2.2.0 Results from timing analysis (Altera Quartus 5.1 software)
Preliminary Figure of Merit Altera block floating point circuits “Streaming” (continuous data in and out) Comparable dynamic range and signal to noise ratio Circuits mapped to Altera Stratix II FPGAs Altera circuit from Megacore FFT v2.2.0 FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz) *Estimate (no timing analysis or layout)
Performance Comparison: 256-point DFT
Comparative Features Transform size N not restricted to powers of two Circuit is scalable Uses block floating point and floating point Higher throughput Low computational latency Based on small, simple PE (adder), locally connected 1-D or 2-D transforms