Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
DSPs Vs General Purpose Microprocessors
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
The University of Adelaide, School of Computer Science
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Memory Basics. 8-1 Memory definitions Memory is a collection of cells capable of storing binary information. Two types of memory: –Random-Access Memory.
Image Compression System Megan Fuller and Ezzeldin Hamed 1.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
A Systolic FFT Architecture for Real Time FPGA Systems.
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Programmable logic and FPGA
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Chapter 6 Memory and Programmable Logic Devices
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Student : Andrey Kuyel Supervised by Mony Orbach Spring 2011 Final Presentation High speed digital systems laboratory High-Throughput FFT Technion - Israel.
Computational Technologies for Digital Pulse Compression
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Spartan-II Memory Controller For QDR SRAMs Lobby Pitch February 2000 ®
Efficient FPGA Implementation of QR
Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
J. Christiansen, CERN - EP/MIC
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use ECE/CS 352: Digital Systems.
CPEN Digital System Design
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Computer Architecture Lecture 24 Fasih ur Rehman.
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
COMP541 Memories II: DRAMs
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Fast Fourier Transforms. 2 Discrete Fourier Transform The DFT pair was given as Baseline for computational complexity: –Each DFT coefficient requires.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
© 2009 Altera Corporation Floating Point Synthesis From Model-Based Design M. Langhammer, M. Jervis, G. Griffiths, M. Santoro.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
CORDIC Based 64-Point Radix-2 FFT Processor
COMP541 Memories II: DRAMs
Presenter: Darshika G. Perera Assistant Professor
Backprojection Project Update January 2002
Embedded Systems Design
Spartan FPGAs مرتضي صاحب الزماني.
Digital Signal Processors
Centar ( Global Signal Processing Expo
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
ADSP 21065L.
Programmable logic and FPGA
Presentation transcript:

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized for Fast Ultra Long FFTs  Parallel FFT structure reduces external memory bandwidth requirements  Lengths from 32K to 256M  Optimized for continuous data FFTs  Architecture reduces the algorithm to two smaller manageable FFT engines

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Ultra Long FFTs  An FFT length that exceeds the internal memory requirements of the FPGA or ASIC  System cost can be reduced in moderate length FFTs in designs where the FPGA/ASIC size is driven by the memory requirements.  This architecture puts most of the storage for the FFT off chip in relatively inexpensive SRAM, reducing the system cost.  Ultra Long FFTs have a similar structure to 2D FFTs  Cooley-Tukey algorithm  Minimizes external memory IC count and bandwidth

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. What Ultra Long FFTs Need  The logic requirements for a 1M FFT are only double a 1K FFT, while the memory requirements are 1000 times.  Logic for 1M FFT easily fits into large FPGA  Memory requirements exceed what is available even in a large FPGA The following shows the execution unit (logic) and memory requirement for continuous data FFTs of two lengths:

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Computing N = N 1 x N 2 The N 1 x N 2 FFT can be computed as: Computing this for: and Results in: for as desired

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. N = N 1 x N 2 Architecture  Three banks of external QDR Memory (single copy each)  Two continuous data FFTs (N 1, N 2 ) inside FPGA  Twiddle Multiply provides vector rotation between N 2 and N 1 FFTs.  Final matrix transpose for normal order output.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. QDR SRAM  Simultaneous read/writes (separate address/data bus) allow single bank of memory per memory transpose.  DDR Style I/O so dual clock edge transfer with FPGA results in narrower data path.  Single copy can be kept at each stage while maintaining continuous data flow.  Special address sequence employed so data isn't overwritten in continuous data application. Reduce IC count.  QDR with Virtex II Pro I/O up to 150MHz (read/write)  QDR II with Virtex II Pro I/O up to 200MHz (read/write)

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. CORDIC For Twiddle Factors Generation  CORDIC produces the sin/cos terms for angle input.  Almost N/2 twiddle factors required.  Very large ROM for FPGA or ASIC.  CORDIC a natural fit, use coordinate product as input.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Matrix Transpose Address Sequence  Allows single copy for each matrix transpose.  Operates on continuous data, one point read/written per clock cycle.  Reduces memory IC count.  Simple logic for sequence generation.  Works for square or rectangular matrices.  Sequence repeats after log 2 (N) sets.  Write always follows read.  Simple N = N 1 x N 2 = 8 example:  First and last matrix transpose go left to right in table, second right to left.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Fixed vs. Floating Point  Numbers in radix-2 FFT can grow by log 2 (N), or 1 bit per butterfly rank.  A 1M FFT can have 20 bits of growth. With 16 bit inputs results would be 36 bits.  Scaling always required in fixed point versions.  Fixed point scaling should be limited to every to every other rank, so 10 times for a 1M FFT producing 26 bit results from 16 bit input.  Floating point FFT maintains precision without overflowing.  Floating Point FFT uses approximately 8 times the logic of a similar precision fixed point version.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Virtex II Pro Performance – 512K FFT  80MHz Continuous Data  1K FFT Engine – 4 butterflies  512 FFT Engine – 4 butterflies  FFT Engines at 160MHz  QDR memory at 80MHz  Real 14 bit input, complex 24 bit output  Virtex II Pro – Device Usage  Slices - 12,500  BlockRAM  MULT18x18 – 88  Fits in XC2VP40

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Other Uses of Architecture  2D FFT – Remove first matrix transpose and twiddle multiply.  Variable Length – Use variable length FFTs and dynamic matrix transpose blocks.  Mixed Radix FFTs – Substitute other than radix-2 for 2 nd FFT.  Performance increases easy with parallel input radix-2 FFTs and multiple paths to SRAM.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. Other Dillon Engineering Resources  ParaCore Architect (parameterized core builder)  DSP Algorithms  Mixed radix FFTs  2D FFTs for image processing  Fixed or floating-point FFTs  Floating point math library  AES Cryptography  System level DSP  OFDM Transceivers  Radar Processing on single FPGA  Image Compression/Processing  Hardware/Software SOC  High speed Ethernet Appliances  Linux Based SOC in FPGA  MicroBlaze application