A New Class of High Performance FFTs Dr. J. Greg Nash Centar (www.centar.net) High Performance Embedded Computing (HPEC) Workshop.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
The Discrete Fourier Transform. The spectrum of a sampled function is given by where –  or 0 .
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Image Compression System Megan Fuller and Ezzeldin Hamed 1.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
High Speed Hardware Implementation of an H.264 Quantizer. Alex Braun Shruti Lakdawala.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Introduction to Fast Fourier Transform (FFT) Algorithms R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
Chapter 15 Digital Signal Processing
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
An Extra-Regular, Compact, Low-Power Multiplier Design Using Triple-Expansion Schemes and Borrow Parallel Counter Circuits Rong Lin Ronald B. Alonzo SUNY.
DSP in FPGA.
GPGPU platforms GP - General Purpose computation using GPU
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Computational Technologies for Digital Pulse Compression
Efficient FPGA Implementation of QR
Chapter One Introduction to Pipelined Processors.
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Lecture 4 Multiplier using FPGA 2007/09/28 Prof. C.M. Kyung.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
An Efficient FPGA Implementation of IEEE e LDPC Encoder Speaker: Chau-Yuan-Yu Advisor: Mong-Kai Ku.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Speaker: Darcy Tsai Advisor: Prof. An-Yeu Wu Date: 2013/10/31
Company LOGO Project Characterization Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
© 2009 Altera Corporation Floating Point Synthesis From Model-Based Design M. Langhammer, M. Jervis, G. Griffiths, M. Santoro.
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
FPGA BASED REAL TIME VIDEO PROCESSING Characterization presentation Presented by: Roman Kofman Sergey Kleyman Supervisor: Mike Sumszyk.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
 presented by- ARPIT GARG ISHU MISHRA KAJAL SINGHAL B.TECH(ECE) 3RD YEAR.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
CORDIC Based 64-Point Radix-2 FFT Processor
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Automatic Generation of Systolic Array Designs For Reconfigurable Computing Greg Nash Engineering of Reconfigurable Systems and Algorithms (ERSA '02) International.
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Cache Memory Presentation I
Centar ( Global Signal Processing Expo
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
C Model Sim (Fixed-Point) -A New Approach to Pipeline FFT Processor
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
Presentation transcript:

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop September 2006

New Base-4 DFT Matrix Equation “ ”= element by element multiply Traditional DFT Matrix form: New Matrix form for DFT † C M 1 and C M 2 contain only elements from the set – C M 1 X and C M 2 Y t only involve complex additions/subtractions Twiddle factor matrix W M is of size N/4 x N/4 rather than N x N of C – x16 fewer multiplies than traditional DFT equation (Z=CX) † J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform, ” IEEE Transactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp – 4651.

Find Systolic Architecture Using SPADE † Mathematical Algorithm Automatic Search for Space-Time Transformations, T Input Code Simulator, Graphical Outputs for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; † Symbolic Parallel Algorithm Development Environment -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory FPGA Architectural Constraints Objective Functions

Functional Operation Processing flow for DFT of length N = N 1 * N 2 Stage 1: N 2 column DFTs (X ci ) of length N 1 Stage 2: Twiddle multiplication Stage 3: N 1 row DFTs (X ri ) of length N 2 Systolic adder arrays for matrix multiplication –N 1 /4 x 4 array for column multiplies C M1 X ci and C M2 Y t ci –N 2 /4 x 4 array for row multiplies C M1 X ri and C M2 Y t ri N 2 /4 x 4 array is implemented virtually on one row of N 1 /4 x 4 array Uses systolic 1-D array matrix multiplication

FFT Systolic Architecture Simple PEs, locally connected  Higher clock speeds  Easier design/test/maintainability  Lower power  Efficient use of FPGA fabric  Simple control Small memory blocks (one per PE)  Faster read/write times  Lower power Linear structure (scales in N/S direction)  Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers) Example Architecture for N = 1024 (N 1 = N 2 = 32)

Enhanced Functionality Transform size N not restricted to powers of two –N = 256n, (n = 1,2,3,..) –More reachable points –Uniform distribution of points Circuit is scalable –Any DFT size can be computed on the same hardware with sufficient memory –Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks Low computational latency –Pipeline depth small, vs for traditional pipelined FFTs 1-D and 2-D transforms possible on the same circuit

Block Floating Point/Floating Point Operation Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT) –Column DFTs use block floating point and row DFTs use floating point –Higher dynamic range and lower signal to noise ratio Number of regions increases with transform size Supports streaming FFT’s Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):

Performance Comparison: 256-point DFT Altera block floating point circuit “Streaming” (continuous data in and out) Comparable dynamic range and signal to (roundoff) noise ratio Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA Altera circuit from Megacore FFT v2.2.0 Results from timing analysis (Altera Quartus 5.1 software)

Preliminary Figure of Merit Altera block floating point circuits “Streaming” (continuous data in and out) Comparable dynamic range and signal to noise ratio Circuits mapped to Altera Stratix II FPGAs Altera circuit from Megacore FFT v2.2.0 FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz) *Estimate (no timing analysis or layout)

Performance Comparison: 256-point DFT

Comparative Features Transform size N not restricted to powers of two Circuit is scalable Uses block floating point and floating point Higher throughput Low computational latency Based on small, simple PE (adder), locally connected 1-D or 2-D transforms