GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Slides:

Advertisements

Similar presentations

Basic HDL Coding Techniques

Advertisements

FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved FPGA and ASIC Technology Comparison, Part 2.

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Architecture-Specific Packing for Virtex-5 FPGAs

1 Reconfigurable Computing Lab UCLA FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image.

August 2004Multirate DSP (Part 2/2)1 Multirate DSP Digital Filter Banks Filter Banks and Subband Processing Applications and Advantages Perfect Reconstruction.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Distributed Arithmetic

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Altera FLEX 10K technology in Real Time Application.

3. Digital Implementation of Mo/Demodulators

Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.

Digital Signal Processing and Field Programmable Gate Arrays By: Peter Holko.

BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Embedded Systems: Introduction. Course overview: Syllabus: text, references, grading, etc. Schedule: will be updated regularly; lectures, assignments.

Chapter 15 Digital Signal Processing

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.

© 2003 Xilinx, Inc. All Rights Reserved Multi-rate Systems.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GPGPU platforms GP - General Purpose computation using GPU

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Study of AES Encryption/Decription Optimizations Nathan Windels.

DSP Techniques for Software Radio DSP Front End Processing Dr. Jamil Ahmad.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

Highest Performance Programmable DSP Solution September 17, 2015.

DLS Digital Controller Tony Dobbing Head of Power Supplies Group.

Computers Are Your Future Eleventh Edition Chapter 2: Inside the System Unit Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

© 2003 Xilinx, Inc. All Rights Reserved HDL Co-Simulation.

DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.

J. Christiansen, CERN - EP/MIC

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

This material exempt per Department of Commerce license exception TSU Multi-rate Systems.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

ECE 448: Lab 6 DSP and FPGA Embedded Resources (Digital Downconverter)

By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.

Digital Phase Control System for SSRF LINAC C.X. Yin, D.K. Liu, L.Y. Yu SINAP, China

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Digital Phase Control System for SSRF LINAC C.X. Yin, D.K. Liu, L.Y. Yu SINAP, China

© 2002 ® Wireless Solution Update Asif Batada Marketing Manager, Wireless Business Unit Asif Batada Marketing Manager, Wireless Business Unit.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

By: Daniel Barsky, Natalie Pistunovich Supervisors: Rolf Hilgendorf, Ina Rivkin Characterization Sub Nyquist Implementation Optimization 11/04/2010.

THE MICROPROCESSOR A microprocessor is a single chip of silicon that performs all of the essential functions of a computer central processor unit (CPU)

© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Backprojection Project Update January 2002

Digital Down Converter (DDC)

Introduction to Programmable Logic

Embedded Systems Design

FPGAs in AWS and First Use Cases, Kees Vissers

Subject Name: Digital Signal Processing Algorithms & Architecture

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Presentation transcript:

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc

GallagherP188/MAPLD20042 Why DSP in FPGAs Availability of fast analog-to-digital converters (ADCs) –Enables digital methods for functions traditionally done in RF components Massive parallel processing –FPGAs may have several hundred embedded multipliers on-chip –One FPGA can replace many DSP Processors

GallagherP188/MAPLD20043 Architectural Considerations FPGA architectures are vendor specific –Unlike ASICS, no two are alike FPGA vendors develop distinct competencies –In device architecture design –In intellectual property (dsp functions, bus controllers, etc) –In design tool flows Vendor independent HDL can be written but this usually achieves mediocre results in clock speed and design size instantiation

GallagherP188/MAPLD20044 FPGAs Are Massive Parallel Computing Machines LPF Multi Channel Filter 80MHz Samples ch1 ch2 ch3 ch4 LPF 20MHz Samples FPGAs are ideally suited for multi-channel DSP designs –Many low sample rate channels can be multiplexed (e.g. TDM) and processed in the FPGA, at a high rate –Interpolation (using zeros) can also drive sample rates higher

GallagherP188/MAPLD20045 FPGAs Allow Space/Speed Trade-offs Q = (A x B) + (C x D) + (E x F) + (G x H) can be implemented in parallel × × × × A B C D E F G H Q But is this the only way in the FPGA?

GallagherP188/MAPLD20046 × × × × × + + D Q × × Parallel Semi-ParallelSerial Customize Architectures to Suit your Ideal Algorithms FPGAs allow Area (cost) / Performance tradeoffs Optimized for?SpeedArea

GallagherP188/MAPLD20047 Exploitng The Xilinx Architecture For DSP Functions Memory Blocks that can be configured as ROMs, dual port RAMs, FIFOs Embedded 18x18 multipliers that can be ganged to form a 35x35 bit multiply SRL16 shift registers –A patented technique for turning the 4 input lookup table (2 per slice) into an addressable shift register

GallagherP188/MAPLD20048 Using SRL16E to increase Compute Density k3k3 ‘0’ + k2k2 + k1k1 + k0k MHz 4 channels 9 20MHz k3k3 ‘0’ + k2k channels SRL16E takes the same area as one LUT. It can be used for up to 16 channels.

GallagherP188/MAPLD20049 Xilinx System Generator For DSP –System Generator is a Block Set that resides in Simulink/Matlab environment. –System Generator blocks are bit true and cycle true models of Xilinx’s DSP intellectual property (IP) cores. –Hardware DSP design capture is significantly accelerated due to automatic code generation from Simulink

GallagherP188/MAPLD Algorithm Instantiation Considerations There are cases where following a textbook approach does not necessarily translate into an efficient instantiation Manipulating the algorithm to exploit features of the architecture can lead to much more efficient instantiations Modification of a text book algorithm includes how the math is executed as well as over-clocking structures to allow the structures to be time division multiplexed

GallagherP188/MAPLD Example 1: Digital Down Conversion In digital down conversion we need to filter before we decimate to prevent aliasing These filters can get rather large because the transition band is rather narrow in relation to the sample rate A text book solution is to step the sample rate down in steps

GallagherP188/MAPLD Digital Down Conversion The following 3 slides show three different filter designs for the down conversion of a.625 Mhz band of interest that is centered at 20 MHz and sampled at MHz. – The decimation rate is 25 –The final sample rate will be 61.44/25= MHz The next slide shows the filter design needed if decimating by 25 in one step –the total coefficient count is 184 The two slides after the next show the two filters necessary to decimate in steps, decimating by 5 in each step –The total coefficient count is 11+43=54

GallagherP188/MAPLD200413

GallagherP188/MAPLD200414

GallagherP188/MAPLD200415

GallagherP188/MAPLD Digital Down Conversion (DDC) Implementation The following design shows how the DDC function would be implemented using the FIR filter core from the Xilinx Library The coefficients are automatically loaded into the filter cores The design has been compiled and was found to use about 6000 logic slices The fir filter core is a legacy core and is built as an optimized lookup table of coefficients

GallagherP188/MAPLD Digital Down Conversion Implementation

GallagherP188/MAPLD DDC –Another Way While we were able to exploit the math of DSP to reduce our coefficient count, we did not necessarily exploit the Xilinx architecture. The next design shows a design that implements the 184 coefficient filter but is significantly smaller in instantiation size then the previous design This design exploits the memory, embedded multipliers, and SRL16s

GallagherP188/MAPLD200419

GallagherP188/MAPLD Multiplexing I&Q multiplication so that just one filter is needed instead of two Time Division Multiplexed Input

GallagherP188/MAPLD Efficient Shift Registers via SRL16s Delay line would require 16x50x7=5200 registers which would be 2800 logic slices. Use of SRL16s reduces slice count to less then 700

GallagherP188/MAPLD Clock Based Demuxing And Automatic Pipeline Balancing Down sample block grabs last sample in a frame Down sample block grabs next sample in a frame Delay block “slide” frame Balancing latencies is a common requirement in DSP designs. The Sync block uses SRL16s (very efficient) to automatically balance pipeline delays

GallagherP188/MAPLD Notes on Previous Design One filter structure is used by clocking the filter at twice the rate of the incoming data The coefficients are stored in memory, 25 per rom. There are 200 coefficients but this approach allows storage of many more The delay between taps is built using SRL 16s. This would have taken 2800 slices alone without SRL16s but instead the entire design is less that 700 slices

GallagherP188/MAPLD Channelizer Design The following design is a 64 channel channelizer based on the technique known as polyphase decimation filter with a DFT bank The design basebands and decimates 64 channels simultaniously The polyphase decimation is the same structure as the previous design, hence very efficient device utilization. This filter structure uses the on-chip ram blocks of the Xilinx device to store the coefficients This technique requires a tapped shift register that requires 6272 registers (3136 slices). However, Xilinx’s patented ability to turn the logic look-up table into a 16 bit register reduces this require by more than an order of magnitude. The whole design is less than 1700 slices. The DFT is implemented with a streaming fft core. The streaming mode allows the FFT to keep up with the data rate Individual channels out of the fft are demuxed using the implied clocking technique seen in the previous design

GallagherP188/MAPLD Coefficients are stored in on chip block rams 64 pt FFT set to streaming mode

GallagherP188/MAPLD Filter coefficients are stored in on-chip block rams. A new phase of the 64 phase-polyphase filter is rotated into the multipliers on every clock cycle. There are 64 phases x 8 taps =512 coefficients

GallagherP188/MAPLD200427

GallagherP188/MAPLD Conclusion Efficient FPGA instantiation of DSP algorithms requires exploitation of the FPGA vendor’s architecture. Xilinx’s Virtex II architecture is especially amenable to systolic computation structures FPGA architectures may present non-obvious instantiation choices that are more efficient then a typical textbook approach Algorithms can and should be modified for parallelized data flow instantiation.