Presentation is loading. Please wait.

Presentation is loading. Please wait.

Centar ( Global Signal Processing Expo

Similar presentations


Presentation on theme: "Centar ( Global Signal Processing Expo"— Presentation transcript:

1 Centar (www.centar.net) Global Signal Processing Expo
A High Performance Block Floating Point Systolic FFT Not Limited to Powers of Two Dr. J. Greg Nash Centar ( Global Signal Processing Expo October 30 to November 2, 2006

2 Desired Features Transform size N not restricted to powers of two
Scalable circuit High dynamic range Low computational latency 1-D or 2-D transforms Simple circuit High throughput

3 Discreet Fourier Transform
Mathematical form: C (M=16) : Multiplications = M 2

4 Inputs X and Outputs Z in Bit-reversed Form (N=16)
Cb = é ë ê ù û ú d1 1 d2 d3 d4 - I -1 W 2 3 4 6 9 “ ”= element by element multiply

5 Base-4 DFT Matrix Equation
General Form: Coefficient matrices are where

6 Find Systolic Architecture Using SPADE†
Mathematical Algorithm Simulator, Graphical Outputs for j to M/4 do for k to M/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..M/4); od Input Code FPGA Architectural Constraints Objective Functions -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory Automatic Search for Space-Time Transformations, T †Symbolic Parallel Algorithm Development Environment

7 DFT Architecture Base-4 DFT Equations: Base-4 DFT Architecture:

8 Base-4 DFT Array (M=16)

9 Base-4 DFT Array (M= 32)

10 Processing flow for DFT of length N = Nr Nc
Nc column DFTs (Xci) of length Nr Array length is Nr /4 N/4 clock cycles Twiddle multiplication Only multipliers used 4 Nc clock cyles Without this step a 2-D FFT is done Nr row DFTs (Xri) of length Nc (Nc)2/4 clock cylces

11 Possible Transform Sizes
Base-4 Matrix derivation requires M = 16, 32, 48,... N = Nr Nc = (16p) (16q) = 256n Base-2: Matrix derviation assumes M = 4, 8, 12, ... N = Nr Nc = (4p) (4q) = 16n Base-2 (No row/column factorization) N = M = 4n (n,p,q = 1,2,3,..)

12 FFT Performance Comparisons
Based on “Streaming” FFT (continuous data in and out) Benchmark against Altera FFT (Block Floating Point) Base-4 16-bit circuit Choose Altera circuit with comparable signal to (roundoff) noise ratio Circuits mapped to same Altera Stratix II FPGA Same compiler used (Altera Quartus)

13 Block Floating Point Usage
Each row has separate BFP support circuitry Row DFT inputs normalized to same exponent Row DFT outputs use FP One exponent for each ouput point Comparison of “single tone” data sets: N=1024

14 Figure of Merit Estimates vs Transform Size
FOM = Area (ALMs) x Throughput (Cycles/DFT) x Mem (Kbits)/Clock(Hz) “Streaming” circuits: Altera (20-bit) and base-4 (16-bit)

15 Scaling Option (1) Trade-off between throughput and resouces used
FOM = Area (ALMs) x Throughput (Cycles/DFT)/ Clock (MHz)/1000 Nominal clock = 350MHz Estimates

16 Non-Power-of-Two Comparison
FOM = Area (ALMs) x Throughput (cycles/DFT) x Memory(Kbit)/Clock (MHz) Nominal clock = 350MHz Non-power-of-two

17 Scaling (2) Use same circuit to do different transform sizes (e.g., run-time) Base-4 matrix equation: Process each CB multiplication separately using blocks of 4 rows Example: 1024-point transform (Nr=Nc=32)

18 Scaling (2) Cycle input twice Option 1 Option 2 All column DFTs
All twiddles All row DFTs Option 2 Normal ordering (half Z values) (other half) N = 1024 Nr=Nc=32

19 Desired Features Transform size N can be any multiple of 256
Scalable circuit Any DFT size can be computed on the same circuit with sufficient memory Larger circuits constructed by replication of identical 4x4 PE array blocks Choose Nr and Nc for speed-area tradeoff BFP/FP options reduces word length by ~4-bits Low computational latency Pipeline depth small, vs for traditional pipelined FFTs 1-D and 2-D transforms possible on the same circuit Simple circuit (mesh array of identical adder cells) High throughput (higher clock frequency, fewer clock cycles/DFT)

20 Precision 1024-point transform Random real and complex inputs
18 data sets


Download ppt "Centar ( Global Signal Processing Expo"

Similar presentations


Ads by Google