MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini
MAPLD 2005Ardini2 Motivation Implementations of algorithms in ANSI C for –Rapid protyping –Incorporation into reconfigurable platform with runtime partitioning or binding (same FFT mapped to HW or SW) Establish a method for software engineers to generate IP
MAPLD 2005Ardini3 Goals Show drastic reduction in IP development time Beat DSP performance in throughput and area while maintaining energy consumption Allow production of coprocessor IP with small learning curve: weeks not months
MAPLD 2005Ardini4 FFT Test Algorithm Well understood and studied Frequently used Standard DSP benchmark Standard software implementation available (Numerical Recipes in C) Radix2 standard in C and DSP and used for this study
MAPLD 2005Ardini5 RTL Generator ImpulseC chose for this study –ANSI C –Simple modifications to algorithm to compile for processor Data I/O path Word types as simple #defines –High level of abstraction Small learning curve Give up low-level control of registers/signals Some control over max gate delay using #pragma –Desktop simulation for fast algorithm debug
MAPLD 2005Ardini6 FPGA wrapper Test Environment Alpha-Data VirtexII Pro card on PCI bus Simple bus wrapper also counts clocks to execute FFT algorithm Use Visual C++ to write high level application code IP Local bus to PCI bridge, PC
MAPLD 2005Ardini7 FFT Structure Classic DIT radix 2 structure requires (N/2)log 2 (N) butterfly computations –5120 for our 1024 test case Butterflies evaluated with 3 nested loops: –Outer walks the stages –Middle walks the butterflies for each branch –Inner walks the branches
MAPLD 2005Ardini8 Butterfly Loop Structure // butterfly operation CMPLX_RD( i, cmplxI ); CMPLX_RD( j, cmplxJ ); tempr = (wr*cmplxJ[REAL] - wi*cmplxJ[IMAG]) >> FAC_SHIFT; tempi = (wr*cmplxJ[IMAG] + wi*cmplxJ[REAL]) >> FAC_SHIFT; cmplxJ[REAL] = cmplxI[REAL] - tempr; cmplxJ[IMAG] = cmplxI[IMAG] - tempi; cmplxI[REAL] += tempr; cmplxI[IMAG] += tempi; CMPLX_WR( i, cmplxI ); CMPLX_WR( j, cmplxJ ); Outer loop Middle loop Inner loop
MAPLD 2005Ardini9 General IP Strucutre Written as FFT coprocessor block with input / output “stream” model // stream in N points // butterfly computation loops (prior page) // stream out N points
MAPLD 2005Ardini10 DSP Benchmark Clock cycles to complete FFT calculation, time from last data in to first data available is –Ref “TMS320C55x DSP Library Programmer’s Reference,” TI SPRU422H, Oct 2004
MAPLD 2005Ardini11 Implementation A Direct mapping of classic Decimation in Time (DIT) algorithm to fixed point code Calculation in place using single data buffer for complex numbers Use 2 word arrays for internal representation of complex numbers
MAPLD 2005Ardini12 Implementation A Results Implementation effort: about 1 week –About 100 SLOC Clocks to complete FFT: 48162, about 2x DSP Inner butterfly loop takes 9 clocks I/O loops take 4 clocks per point Slices: 536 (includes simple bus wrapper) Multipliers: 8 Block RAMs: 2
MAPLD 2005Ardini13 Implementation B Scalarize internal complex number representation to eliminate memory contention: // int16 cmplxI[2] // int16 cmplxJ[2] // becomes int16 cmplxIReal, cmplxIImag int16 cmplxJReal, cmplxJImag Allows simultaneous assignements to real and imaginary parts of complex working variables Reads and writes of working variables done with #defines to hide implementation: // e.g. #define CMPLX_RD(ofst,dest) dest##Real = dataBuf[ofst]; dest##Imag = dataBuf[ofst+1] // CMPLX_RD( i, cmplxI );
MAPLD 2005Ardini14 Implementation B Results Clocks to complete FFT: 32802, about 1.4x DSP Inner butterfly loop takes 6 clocks –Savings is 3 clocks * 5120 flies = clocks I/O loops take 4 clocks per point Slices: 398 Multipliers: 8 Block RAMs: 2
MAPLD 2005Ardini15 Implementation C Replace single input data buffer with imag and real buffers Allows simultaneous access access to re,im parts of data buffer realBuf ImagBuf realBuf ImagBuf
MAPLD 2005Ardini16 Implementation C Results Clocks to complete FFT: 17442, about 0.7x DSP Inner butterfly loop takes 3 clocks –Savings is 3 clocks * 5120 flies = clocks I/O loops now take 3 clocks per point Slices: 425 Multipliers: 8 Block RAMs: 2
MAPLD 2005Ardini17 Implementation D Examine DIF structure After first stage, to handle 2 parallel engines Could also be DIT
MAPLD 2005Ardini18 Implementation D Note first stage calculations can be handled as data arrives Also note last stage could be handled as data leaves Input Stage Main fly Engines Output stage, add/sub Simple data input Input with butterfly
MAPLD 2005Ardini19 Implementation D Implement 2 butterflies in parallel –double up code, tool worries about parallelism Hide first and last butterfly stages by peforming butterflies as data arrives/leaves –Note that last stage is trivial multiplications, so no FPGA multipliers are required Also places twiddles in ROM to lower use of FPGA multiplier resources
MAPLD 2005Ardini20 Clocks to complete FFT: 7186, about 0.3x DSP Inner butterfly loop still takes 3 clocks I/O loops still take 3 clocks per point Savings due to parallelism: –8 stages*(512/2) flies*3 clocks = 6144 clocks, inner loop –2 clocks * ( 2 n, n=1,2…8) times through loop = 1024 clocks, middle loop Savings due to I/O stage butterflies –2*512*3 = 3072 clocks Slices: 859 (813 w/o bus wrapper) Multipliers: 12 Block RAMs: 8 Max clock rate: 76MHz, VirtexII Pro Implemenation D Results
MAPLD 2005Ardini21 On Size and Power Effective area when placed into VirtexII or Virtex4 FPGAs is on the order of 1/2 to 1/3 that of a DSP based on package sizes and resource utilization Power on the order of mW for Virtex4 device (estimated) Energy for 1024 point FFT: estimated 42 µJ –Estimated 32 µJ for DSP
MAPLD 2005Ardini22 Conclusions / Future Work Implementation time extremely short –1-2 weeks vs. estimated 3+ months with HDL –SW approach without need for understanding reg vs wire, pipelining For clock rates to 75MHz, this design is 3x faster than a DSP –Trade gate delay for clock rate with available #pragma for designs in excess of 75MHz –Use two clock domains: I/O, core Other optimizations –Radix4 –ImpulseC parallel processes –I/O rate can be improved with 32-bit bus