High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop www.centar.net.

Slides:



Advertisements
Similar presentations
David Hansen and James Michelussi
Advertisements

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
The Discrete Fourier Transform. The spectrum of a sampled function is given by where –  or 0 .
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Las Palmas de G.C., Dec IUMA Projects and activities.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
ELEC692 VLSI Signal Processing Architecture Lecture 6
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
CHAPTER 8 DSP Algorithm Implementation Wang Weilian School of Information Science and Technology Yunnan University.
Efficient FPGA Implementation of QR
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.
ESPL 1 Wordlength Optimization with Complexity-and-Distortion Measure and Its Application to Broadband Wireless Demodulator Design Kyungtae Han and Brian.
Z TRANSFORM AND DFT Z Transform
1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
Automatic Evaluation of the Accuracy of Fixed-point Algorithms Daniel MENARD 1, Olivier SENTIEYS 1,2 1 LASTI, University of Rennes 1 Lannion, FRANCE 2.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.
Speaker: Darcy Tsai Advisor: Prof. An-Yeu Wu Date: 2013/10/31
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Fast Fourier Transforms. 2 Discrete Fourier Transform The DFT pair was given as Baseline for computational complexity: –Each DFT coefficient requires.
VLSI Design of 2-D Discrete Wavelet Transform for Area-Efficient and High- Speed Image Computing - End Presentation Presentor: Eyal Vakrat Instructor:
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
1 Paper reading A New Approach to FFT Processor Speaker: 吳紋浩 第六組 洪聖揚 吳紋浩 Adviser: Prof. Andy Wu Mentor: 陳圓覺.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Automatic Generation of Systolic Array Designs For Reconfigurable Computing Greg Nash Engineering of Reconfigurable Systems and Algorithms (ERSA '02) International.
Pipelining and Retiming 1
Linear Filters in StreamIt
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Centar ( Global Signal Processing Expo
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Multiplier-less Multiplication by Constants
Architecture Synthesis
Real time signal processing
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Outline Base-4 transformation for calculating DFT Mapping methodology Direct form DFT architecture FFT architecture Performance

Discreet Fourier Transform Mathematical form: Matrix form Z=CX: (N=16) Multiplications = N 2

Base-4 Matrix Equation Find reordering permutation P DFT matrix equation becomes where

Base-4 Coefficient Matrix

Base-4 DFT Matrix Equation (Compact Form) “ ”= element by element multiply Form for N=16 General Form

Base-4 DFT Equation Characteristics Coefficient matrices represent series of 4-point transforms:  Takes advantage of reduced arithmetic with radix r = 4 butterfly, but transform length not limited to N = r m  Transform length must be divisible by 16 C M 1 and C M 2 contain only elements from the set  C M 1 X and C M 2 Y t only involve complex additions Twiddle factor matrix W M is of size N/4 x N/4 rather than N x N  x16 fewer multiplies than original DFT equation (Z=CX) where

Systolic Array Example: Matrix Multiply Project along time axis Systolic Array: Each intersection point corresponds to a “processing element” (PE) that receives data from its neighbors, does a multiply-add, and passes the result to adjacent PEs, once per time cycle. d e c “Space-Time” View Algorithm: Space-time mapping: computations at {i,j,k} “mapped” to indices {time,x,y}

Find Systolic Architecture Using SPADE † Mathematical Algorithm Automatic Search for Space-Time Transformations, T Input Code Simulator, Graphical Outputs for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; † Symbolic Parallel Algorithm Development Environment Variable position, area, regularity, bandwidth Architectural Constraints Objective Functions

SPADE Functionality SPADE accepts input statements of the affine form –Where A x,B y /a x,b y are integer matrices/vectors, S is the dimension of the algorithm space and the “depends on” includes commutative and associative operators: min, max, ,  SPADE finds latency optimal systolic designs subject to constraints imposed by scheduling, localization, reindexing, and allocation Secondary objective functions used to select architectures are minimum area, maximum regularity and minimum network bandwidth

Systolic Array Designs: Minimum Area Latency (cycles) = N/2 + 8 Six unique designs Throughput (cycles/block) = N/4 + 6 W M mapped to same space-time location as Y IM1 and IM2 variables (SPADE created) perform matrix multiply/adds Y X Z CM1 IM2 IM1 CM2 Space-Time Views (N=64) Example Systolic Array Views (N=64) X Y Z CM1 X Multipliers Adders N/4=16 4

Systolic Array Designs: Maximum Regularity Two unique designs found Throughput and latency optimal Latency (cycles) = N/2 + 8 Throughput (cycles/block) = N/4 +1 W M mapped to same space-time position as Y Space-time view (N=32) Adders Multipliers Systolic Arrays (N=32) X Z CM2 Y IM1 CM2 X Z Y CM1 CM2 Z CM1 X Y CM2 Transformations CM1 Z N/4 = 8 4

Systolic Architecture to Array Design Systolic Architecture (N=32)Array Design (N=32)  Y Z CM1 X CM2

Altera Stratix FPGA: DFT Mapping Systolic DFT Array

1D FFT via Factorization Factor N = N 1 * N 2 Creat a 2-D matrix with N 1 rows by N 2 columns, (assume N 1 > N 2 ), Do N 2 1-D “column” DFTs followed by N 1 “row” DFTs: If N 1  N 2 then (linear) array size can be reduced from O( N 1 N 2 ) to O(N 1 ) with minimal effect on throughput: –Cycles for N/4 array (no factorization) = N/4 + 1 –Cycles for N 1 /4 array = N 1 (N 1 /4 + 1) + N 1 (N 1 /4 + 1) + twiddle mult  N/2 Can do 2-D DFT by not performing twiddle multiplication W N Use base-4 DFT mapping to do all row/column DFTs

Base-4 Factorization Architecture N = 1024 points N = N 1 * N 2 N 1 = N 2 = 32 Uses both of the two optimal systolic designs Twiddle multiplications not shown Throughput/latency optimal except for interstage delay 2 “column” DFTs 2 “row” DFTs Z DFT Output DFT Input Z X X CM2 Z Z CM1 Y Two Space-Time Views (only two of N 1 iterations shown)

Two DFT Architectures Combined Shown for N = 1024 points N = N 1 * N 2 N 1 = N 2 = 32 M = 512 bits (16 bit word)

1 st to 2 nd Stage Data Formatting Problem (32 Point DFT) DFT data positions of 1 st stage output sequences Desired data positions for input sequences to 2 nd stage X Z Y CM1 CM2 CM1 X Y CM2 Z.....

Interstage Data Formatting via “On-the-Fly” Permutations New code with matrix rotation steps New DFT first stage output sequences.....

1-D DFT Performance Estimates Register transfer level behavioral simulation of 1024 point DFT Partially populated layout Timing analysis using Altera Stratix EP1S60 FPGA chip 16 bit fixed-point word length Based on:

Latency Base-4 FFT pipeline depth is nominally N 1 /4+ 9 << N Latency (cycles)  1/Throughput (cycles -1 ) when complete X available N 1 /4 9 longest path (red)

Partitioning to Scale Computations to Application Use an array “section” to perform partially processed result Partial results accumulated at output Memory needed scales with partition size  Fully Parallel ArrayPartitioned Array

Non-Square 2-D Inputs (N 1  N 2 ) Example: 512-point FFT (N 1 = 32, N 2 = 16) On-the-fly permutations for correct data placement Rows: Compute 2 sets of point DFTs Columns: Compute point DFTs

Example Resource Usage † : 1024 Point DFT † Altera Stratix EP1S60F1508C6 FPGA chip (16 bit fixed point)

Base-4 DFT Architecture Summary High performance 1-D and 2-D DFTs Based on latency and throughput optimal parallel circuits Transform size not restricted to N = r m Latency 1/throughput when entire input block available Architecture is scaleable and easily parameterized Design is simple, regular, local and synchronous Fast convolutions naturally supported Natural partitioning strategies exist Pseudo-linear architecture good fit to latest generation of FPGA chips

More Information at “Automatic Generation of Systolic Array Designs For Reconfigurable Computing”, Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA '02), International Multiconference in Computer Science, Las Vegas, Nevada, June 24, –General description of SPADE –Faddeev algorithm (Find CX+D, given AX=B, X is unknown) Constraint Directed CAD Tool For Automatic Latency-Optimal Implementations, SPIE ITCom 2002, Boston, Massachussetts, July 29-August 2, –Use of constraints as a filter of systolic designs –2-D Discreet Fourier transforms using base-4 architecture