MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

Slides:



Advertisements
Similar presentations
David Hansen and James Michelussi
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
White and Gloster P741 An Implementation of the Discrete Fourier Transform on a Reconfigurable Processor By Michael J. White 1,2* and Clay Gloster, Jr.,
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Delevopment Tools Beyond HDL
Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.
Highest Performance Programmable DSP Solution September 17, 2015.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Student : Andrey Kuyel Supervised by Mony Orbach Spring 2011 Final Presentation High speed digital systems laboratory High-Throughput FFT Technion - Israel.
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
Automated Design of Custom Architecture Tulika Mitra
Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
J. Christiansen, CERN - EP/MIC
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU FPGA Design with Xilinx ISE Presenter: Shu-yen Lin Advisor: Prof. An-Yeu Wu 2005/6/6.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Sub-Nyquist Sampling Algorithm Implementation on Flex Rio
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
MAPLD 2005Ardini1 Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Low Power Design for a 64 point FFT Processor
CORDIC Based 64-Point Radix-2 FFT Processor
FPGAs in AWS and First Use Cases, Kees Vissers
CSCI1600: Embedded and Real Time Software
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini

MAPLD 2005Ardini2 Motivation Implementations of algorithms in ANSI C for –Rapid protyping –Incorporation into reconfigurable platform with runtime partitioning or binding (same FFT mapped to HW or SW) Establish a method for software engineers to generate IP

MAPLD 2005Ardini3 Goals Show drastic reduction in IP development time Beat DSP performance in throughput and area while maintaining energy consumption Allow production of coprocessor IP with small learning curve: weeks not months

MAPLD 2005Ardini4 FFT Test Algorithm Well understood and studied Frequently used Standard DSP benchmark Standard software implementation available (Numerical Recipes in C) Radix2 standard in C and DSP and used for this study

MAPLD 2005Ardini5 RTL Generator ImpulseC chose for this study –ANSI C –Simple modifications to algorithm to compile for processor Data I/O path Word types as simple #defines –High level of abstraction Small learning curve Give up low-level control of registers/signals Some control over max gate delay using #pragma –Desktop simulation for fast algorithm debug

MAPLD 2005Ardini6 FPGA wrapper Test Environment Alpha-Data VirtexII Pro card on PCI bus Simple bus wrapper also counts clocks to execute FFT algorithm Use Visual C++ to write high level application code IP Local bus to PCI bridge, PC

MAPLD 2005Ardini7 FFT Structure Classic DIT radix 2 structure requires (N/2)log 2 (N) butterfly computations –5120 for our 1024 test case Butterflies evaluated with 3 nested loops: –Outer walks the stages –Middle walks the butterflies for each branch –Inner walks the branches

MAPLD 2005Ardini8 Butterfly Loop Structure // butterfly operation CMPLX_RD( i, cmplxI ); CMPLX_RD( j, cmplxJ ); tempr = (wr*cmplxJ[REAL] - wi*cmplxJ[IMAG]) >> FAC_SHIFT; tempi = (wr*cmplxJ[IMAG] + wi*cmplxJ[REAL]) >> FAC_SHIFT; cmplxJ[REAL] = cmplxI[REAL] - tempr; cmplxJ[IMAG] = cmplxI[IMAG] - tempi; cmplxI[REAL] += tempr; cmplxI[IMAG] += tempi; CMPLX_WR( i, cmplxI ); CMPLX_WR( j, cmplxJ ); Outer loop Middle loop Inner loop

MAPLD 2005Ardini9 General IP Strucutre Written as FFT coprocessor block with input / output “stream” model // stream in N points // butterfly computation loops (prior page) // stream out N points

MAPLD 2005Ardini10 DSP Benchmark Clock cycles to complete FFT calculation, time from last data in to first data available is –Ref “TMS320C55x DSP Library Programmer’s Reference,” TI SPRU422H, Oct 2004

MAPLD 2005Ardini11 Implementation A Direct mapping of classic Decimation in Time (DIT) algorithm to fixed point code Calculation in place using single data buffer for complex numbers Use 2 word arrays for internal representation of complex numbers

MAPLD 2005Ardini12 Implementation A Results Implementation effort: about 1 week –About 100 SLOC Clocks to complete FFT: 48162, about 2x DSP Inner butterfly loop takes 9 clocks I/O loops take 4 clocks per point Slices: 536 (includes simple bus wrapper) Multipliers: 8 Block RAMs: 2

MAPLD 2005Ardini13 Implementation B Scalarize internal complex number representation to eliminate memory contention: // int16 cmplxI[2] // int16 cmplxJ[2] // becomes int16 cmplxIReal, cmplxIImag int16 cmplxJReal, cmplxJImag Allows simultaneous assignements to real and imaginary parts of complex working variables Reads and writes of working variables done with #defines to hide implementation: // e.g. #define CMPLX_RD(ofst,dest) dest##Real = dataBuf[ofst]; dest##Imag = dataBuf[ofst+1] // CMPLX_RD( i, cmplxI );

MAPLD 2005Ardini14 Implementation B Results Clocks to complete FFT: 32802, about 1.4x DSP Inner butterfly loop takes 6 clocks –Savings is 3 clocks * 5120 flies = clocks I/O loops take 4 clocks per point Slices: 398 Multipliers: 8 Block RAMs: 2

MAPLD 2005Ardini15 Implementation C Replace single input data buffer with imag and real buffers Allows simultaneous access access to re,im parts of data buffer realBuf ImagBuf realBuf ImagBuf

MAPLD 2005Ardini16 Implementation C Results Clocks to complete FFT: 17442, about 0.7x DSP Inner butterfly loop takes 3 clocks –Savings is 3 clocks * 5120 flies = clocks I/O loops now take 3 clocks per point Slices: 425 Multipliers: 8 Block RAMs: 2

MAPLD 2005Ardini17 Implementation D Examine DIF structure After first stage, to handle 2 parallel engines Could also be DIT

MAPLD 2005Ardini18 Implementation D Note first stage calculations can be handled as data arrives Also note last stage could be handled as data leaves Input Stage Main fly Engines Output stage, add/sub Simple data input Input with butterfly

MAPLD 2005Ardini19 Implementation D Implement 2 butterflies in parallel –double up code, tool worries about parallelism Hide first and last butterfly stages by peforming butterflies as data arrives/leaves –Note that last stage is trivial multiplications, so no FPGA multipliers are required Also places twiddles in ROM to lower use of FPGA multiplier resources

MAPLD 2005Ardini20 Clocks to complete FFT: 7186, about 0.3x DSP Inner butterfly loop still takes 3 clocks I/O loops still take 3 clocks per point Savings due to parallelism: –8 stages*(512/2) flies*3 clocks = 6144 clocks, inner loop –2 clocks * (  2 n, n=1,2…8) times through loop = 1024 clocks, middle loop Savings due to I/O stage butterflies –2*512*3 = 3072 clocks Slices: 859 (813 w/o bus wrapper) Multipliers: 12 Block RAMs: 8 Max clock rate: 76MHz, VirtexII Pro Implemenation D Results

MAPLD 2005Ardini21 On Size and Power Effective area when placed into VirtexII or Virtex4 FPGAs is on the order of 1/2 to 1/3 that of a DSP based on package sizes and resource utilization Power on the order of mW for Virtex4 device (estimated) Energy for 1024 point FFT: estimated 42 µJ –Estimated 32 µJ for DSP

MAPLD 2005Ardini22 Conclusions / Future Work Implementation time extremely short –1-2 weeks vs. estimated 3+ months with HDL –SW approach without need for understanding reg vs wire, pipelining For clock rates to 75MHz, this design is 3x faster than a DSP –Trade gate delay for clock rate with available #pragma for designs in excess of 75MHz –Use two clock domains: I/O, core Other optimizations –Radix4 –ImpulseC parallel processes –I/O rate can be improved with 32-bit bus