A Systolic FFT Architecture for Real Time FPGA Systems.

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

© 2003 Xilinx, Inc. All Rights Reserved Looking Under the Hood.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Study of AES Encryption/Decription Optimizations Nathan Windels.

By Grégory Brillant Background calibration techniques for multistage pipelined ADCs with digital redundancy.

Computational Technologies for Digital Pulse Compression

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Efficient FPGA Implementation of QR

The GNU in RADIO Shravan Rayanchu. SDR Getting the code close to the antenna –Software defines the waveform –Replace analog signal processing with Digital.

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Adaptive beamforming using QR in FPGA Richard Walke, Real-time System Lab Advanced Processing Centre S&E Division.

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

Paper Review Avelino Zepeda Martinez High Performance Reconfigurable Pipelined Matrix Multiplication Module Designer.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

EE3A1 Computer Hardware and Digital Design

High Performance FPGA-based Floating Point Adder with Three Inputs

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

By: Daniel Barsky, Natalie Pistunovich Supervisors: Rolf Hilgendorf, Ina Rivkin Characterization Sub Nyquist Implementation Optimization 11/04/2010.

DEPARTMENTT OF ECE TECHNICAL QUIZ-1 AY Sub Code/Name: EC6502/Principles of digital Signal Processing Topic: Unit 1 & Unit 3 Sem/year: V/III.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

An FFT/IFFT Accelerator for OCT Application

Fast Fourier Transforms. 2 Discrete Fourier Transform The DFT pair was given as Baseline for computational complexity: –Each DFT coefficient requires.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

FP7 Uniboard project Digital Receiver G. Comoretto, A. Russo, G. Tuccari, A Baudry, P. Camino, B. Quertier Dwingeloo, February 27, 2009.

1 Paper reading A New Approach to FFT Processor Speaker: 吳紋浩第六組洪聖揚吳紋浩 Adviser: Prof. Andy Wu Mentor: 陳圓覺.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Final Project Report 64 points FFT

Flopoco in LegUp Jenny deng.

Embedded Systems Design

Morgan Kaufmann Publishers

Instructor: Dr. Phillip Jones

FPGA Implementation of Multicore AES 128/192/256

Lecture 41: Introduction to Reconfigurable Computing

Centar ( Global Signal Processing Expo

Case Study Memory Controller مرتضي صاحب الزماني.

Paul D. Reynolds Russell W. Duren Matthew L. Trumbo Robert J. Marks II

Multiplier-less Multiplication by Constants

A.R. Hurson 323 CS Building, Missouri S&T

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

On-line arithmetic for detection in digital communication receivers

چگونه يک سمينار موفق ارائه دهيم

Pipelined Array Multiplier Aldec Active-HDL Design Flow

On-line arithmetic for detection in digital communication receivers

Presentation transcript:

A Systolic FFT Architecture for Real Time FPGA Systems

Radar Processing Application ADC 1.2 GSPS x y 32K Correlation ADC 1.2 GSPS I/QFFTFIFOConjugate I/QFFTFIFO × +×+ 8K FFT bottleneck Real-time Complex 0.6 GSPS input (16-bits) 1.2 GSPS output (12-bits)

Size  Pins??? Fly??? Mult??? Add??? Shift??? Evaluation Scorecard The design changes will be scored based on the following metrics: Length of FFT Butterflies Adder/subtractors IO pins Multipliers Shift registers

Outline Introduction Parallel architecture –Data flow graph –Effects of serial input Systolic architecture Performance summary Conclusions

Baseline Parallel Architecture Parallel FFT Butterfly structure Removes redundant calculation Size  Pins448229K Fly3253K Mult Add Shift00

Complex Butterfly u v x y × + - Butterfly contains –1 complex addition –1 complex subtraction –1 complex, constant multiply Size  Pins448229K Fly3253K Mult Add Shift00

Complex Addition Complex addition adds the real and imaginary parts separately: 2 adds a c b d + + real imag Size  Pins448229K Fly3253K Mult Add128213K Shift00

Complex Multiply The FOIL method of multiplying complex numbers: 4 multiplies and 2 adds a c - + b d × × × × real imag Size  Pins448229K Fly3253K Mult128213K Add192320K Shift00

Efficient Complex Multiply Another approach requires fewer multiplies: 3 multiplies and 5 adds a b - + c d real imag × × × Size  Pins448229K Fly3253K Mult96159K75% Add288480K150% Shift00

Parallel-Pipelined Architecture A pipelined version IO Bound 100% Efficient Size  Pins448229K Fly3253K Mult96159K Add288480K Shift00

Serial Input A serial version IO-rate matches A/D 6.25% Efficient Size  Pins28.01% Fly3253K Mult96159K Add288480K Shift00

Outline Introduction Parallel architecture Systolic architecture –Serial implementation –Application specific optimizations Performance summary Conclusions

The parallel architecture can be collapsed –One butterfly per stage –Consumes 1 sample per cycle –Same latency and throughput –More efficient design Serial Architecture 50% Efficiency Stage 1Stage 2Stage 3Stage 4 Size  Pins28 Fly413.03% Mult % Add % Shift2212K

High Level View Stage 1Stage 2Stage 3Stage Replace complex structure with an abstract cell which contains: –FIFOs –Butterfly –Switch network Size  Pins28 Fly413 Mult1239 Add36117 Shift2212K

8192-Point Architecture Requires 13 stages Fixed point arithmetic Varies the dynamic range to increase accuracy Overflow replaced with saturated value int 14 frac 5 int 13 frac 6 int 12 frac 7 int 11 frac 8 int 10 frac 9 int 9 frac 10 int 8 frac Multipliers limit design to 18-bits and 150 MHz Achieves 70 dB of accuracy 4 int 4 frac Size  Pins28 Fly413 Mult1239 Add36117 Shift2212K

Increase Parallelism Add more pipelines Design limited to 150 MHz by multipliers I/Q module generate 600 MSPS Meets real-time requirement through parallelism Size  Pins % Fly % Mult % Add % Shift1612K100%

Simplification Target application allows a specific simplification Pads a 4096-point sequence with 4096 zeros Removes 1 st stage multipliers and adders Achieves 100% efficiency in steady state Size  Pins % Fly1652 Mult % Add % Shift48K67%

Outline Introduction Parallel architecture Systolic architecture Performance summary –Power, operations per second –FPGA resources, frequency –Latency, throughput Conclusions

Results The current implementation has been placed on a Virtex II 8000 and verified at 150 MHz Power: C GOPS: GOPS/Watt FPGA resources (XC2V8000) –Multipliers: 144 (85%) –LUTs and SRLs: 39,453 (42%) –BlockRAM: 56 (33%) –Filp flops: 35,861 (38%) Frequency: 150 MHz Latency: 1127 cycles Throughput: 1.2 GSPS

Outline Introduction Parallel architecture Systolic architecture Performance summary Conclusions –Applicability to other platforms –Future work

Conclusions Created a high performance, real-time FFT core –Low power (3.9 GOPS/Watt) –High throughput (1.2 GSPS), low latency (7.6 µsec/sample) –Fixed-point (18-bits), high accuracy (70 dB) General architecture –Extendable to a generic FPGA core –Retargetable to ASIC technology Future work –Develop a parameterizable IP core generator

Sources Fredrik Edman Preston Jackson, Cy Chan, Charles Rader, Jonathan Scalera, and Michael Vai HPEC September 2004