Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University This project is supported in part by NSF awards ITR/NGS and SYS and a DARPA DESA program
Slide 2 The Paradox of Reusable IPs Boon to productivity zero effort required zero knowledge required zero chance to introduce new bugs Why repeat what has already been done? Bane to optimality finding the right functionality with the right interface design tradeoff -- performance, area, power, accuracy..... Are you getting what you really wanted? Solution: Solution: parameterized automatic IP generators zero effort, knowledge or bugs allows application specific customization facilitates design exploration
Slide 3 Our Work: Discrete Fourier Transform IPs Discrete Fourier Transform (DFT) important building block in DSP applications numerous design “cores” available Current IP libraries support: various sizes, number formats, data orderings small number only a small number of microarchitecture choices (Xilinx LogiCore DFT gives 3 choices) We generate IPs with custom design tradeoffs degree of parallelism in microarchitecture (min max) resource preference (e.g. BRAM vs. slices in FPGAs) Extensible to other common linear DSP transforms
Slide 4 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 5 Transforms as Formulas [ Transform computation is represented as matrix-vector multiplication Matrix-vector multiplication is O(n 2 ) operations “Fast” algorithms factor the transform into a sequence of structured sparse matrices O(n log n) operations DFT: FFT: Datapath easily formed from factorized formulas
Slide 6 Formula to Datapath Given where is: apply, then is a permutationpermute apply, times in parallel is a diagonalscale A A B A ×4×4 ×2×2 ×7×7 ×8×8
Slide 7 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 8 Simple regular structure embodied in formula Example: Pease DFT diagonal permutation butterfly parallel k stages stage 1 stage 2 stage 3
Slide 9 Pease DFT Example: DFT 8 x x x x x x x x x x x x stage 1 stage 2 stage 3 (formula is applied from right to left) (datapath is built left to right) Repeating column structure hardware reuse without performance penalty without performance penalty
Slide 10 x x x x Horizontal folding x x x x x x x x our baseline design degree of freedom: vertical parallelism p parameter p input bypass register p
Slide 11 Vertical (V-)folding according to p latency Fine-grained control over cost/latency tradeoff cost
Slide 12 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 13 User Interface common DFT options customization options
Slide 14 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions
Slide 15 We compare Xilinx’s fixed design against our variable generated designs Evaluation We compare against Xilinx LogiCore DFT Ver. 3.1 radix-4 burst I/O interface XilinxSPIRAL datapathfixed, one radix- 4 basic block variable, p radix-2 basic blocks cost-performance tradeoff fixed user-controlled, varies with p Comparison DFT n = {64, 1024, 2048}; width = 16; bit-reversed output Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6
Slide 16 DFT 1024 relative to Xilinx Xilinx Performance and resources scale with p 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance
Slide p relative slices p relative BRAMs Resource usage preferences Xilinx 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance p speedup
Slide 18 Resource usage preferences Can control tradeoff between slices and BRAMs Xilinx exchange BRAM for slices very little change in performance 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance
Slide 19 DFT 64 and DFT = 2140 slices 1.0 = 7 BRAMs 1.0 = 1 transform / µsec Trends hold for sizes 64, = 1743 slices 1.0 = 8 BRAMs 1.0 = 1 transform / µsec 64 Xilinx
Slide 20 Related Work Kumhom, Johnson, Nagvajara, ASIC/SOC 2000 universal FFT processor microarchitecture based on processing elements interconnected by on-chip reconfigurable network microarchitecture is scalable in the number of elements supports both Cooley Tukey and Pease Choi, Scrofano, Prasanna, Jang, FPGA’2003 mapped radix-4 Cooley-Tukey algorithm onto log 2 (n)/2 DFT 4 primitives scalable datapath between 1 element and 4 elements at a time show energy and performance improvements from scaling
Slide 21 Conclusions Parameterized DFT IP generator formula-driven matrix formula-driven synthesis performance/cost tradeoff resources vs. latency fine-grained control over resources vs. latency resource usage preference slices and BRAM can balance tradeoff between slices and BRAM Key results efficient: efficient: the Xilinx design point can be matched customizable: design tradeoffs customizable: design tradeoffs directly controllable easy to use: simple yet powerful web interface
Slide 22 Web Generator SPIRAL This work is part of the SPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms. For more information visit:
Slide 23 V-folding according to p (continued) 4 2 0 7 5 3 1 p max = n/2 p min = 1
Slide 24 V-Folding of Permutations [Takala, et al. ICASSP’2001] where