Xilinx Core Solutions Group

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
ECE 506 Reconfigurable Computing ece. arizona
Lecture 15 Finite State Machine Implementation
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
Distributed Arithmetic
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Programmable Logic Devices
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Digital Signal Processing and Field Programmable Gate Arrays By: Peter Holko.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Configurable System-on-Chip: Xilinx EDK
Evolution of implementation technologies
Distributed Arithmetic: Implementations and Applications
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
© 2010 Altera Corporation—Public DSP Innovations in 28-nm FPGAs Danny Biran Senior VP of Marketing.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Delevopment Tools Beyond HDL
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “A Tutorial“ Greg Goslin Digital Signal Processing.
Highest Performance Programmable DSP Solution September 17, 2015.
© 2003 Xilinx, Inc. All Rights Reserved CORE Generator System.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
® Introducing the Xilinx Spartan Series High Performance, Low Cost FPGAs with on-chip SelectRAM Memory.
Section II Basic PLD Architecture. Section II Agenda  Basic PLD Architecture —XC9500 and XC4000 Hardware Architectures —Foundation and Alliance Series.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
® SPARTAN Series High Volume System Solution. ® Spartan/XL Estimated design size (system gates) 30K 5K180K XC4000XL/A XC4000XV Virtex S05/XL.
Programmable Logic Devices
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
Basic Sequential Components CT101 – Computing Systems Organization.
EE3A1 Computer Hardware and Digital Design
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Academy - Xilinx DSP Page 1 Academy - Xilinx DSP Page 2 Existing DSP Solutions Fixed function DSP devices ASICs Standard DSP processors (only programmable.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
CORE Generator System V3.1i
M.Mohajjel. Why? TTM (Time-to-market) Prototyping Reconfigurable and Custom Computing 2Digital System Design.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
© 2005 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU CORE Generator System.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
Introduction to the FPGA and Labs
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
Introduction to Programmable Logic
Embedded Systems Design
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
We will be studying the architecture of XC3000.
A Digital Signal Prophecy The past, present and future of programmable DSP and the effects on high performance applications Continuing technology enhancements.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Win with HDL Slide 4 System Level Design
Electronics for Physicists
Presentation transcript:

Xilinx Core Solutions Group DSP Xilinx Core Solutions Group Why is an FPGA vendor talking to you about DSP?. What is it that we have to say? Well, I’m going to give you an overview of Xilinx DSP and by the end of this presentation you’re going to have a new perspective on solving high performance DSP problems.

Traditional DSP: DSP Processors Single MAC Programmable Off-the-shelf, standard part Hardware multiplier Multiply One MAC (Multiply Accumulate) Time-Shared Performance ceiling Add If you look at the performance that a DSP microprocessor can deliver, there is an inverse relationship between how many operations the processor can perform on a data sample and the sample rate at which the processor can operate. While a processor can perform simple tasks very fast, this performance drops off quite quickly. Adding more processors to a system helps increase the performance capability, but as you can see the increase is not that significant when the cost of adding these processors is considered. Bear in mind this is more than just component cost, the cost of writing complex multi-processor DSP code is easily underestimated. Sequential Processing

Xilinx DSP High Performance Alternative - Parallel Processing Programmable Off-the-shelf, standard part Many Multiplies in one clock cycle! Extend the performance of DSP Processors Multiply Add Multiply Multiply Multiply Add Add Add Multiple MACs, Parallel Processing

Xilinx DSP Solution CORE Generator DSP LogiCOREs Tools Integration System-Level Tools Tools Integration

Existing Xilinx DSP Design Methodology CORE Generator CORE Generator Parameterize DSP LogiCOREs Connect the cores with HLD or schematic M1 XC4000X/Spartan/Virtex

Addition of DSP System Level Tool Tools DSP System level tools Used by all DSP systems engineers 100,000 copy installed base Fit into existing DSP environment Connect through the CORE Generator SystemLINX interface CORE Generator M1

Performance XC4085XL > 10x Faster than 320C6x 5 16-bit FIR Filter Benchmark 4 3 Billions of MACs per Second 2 1 First, performance. At a peak rate of 400 million multiply-accumulates per second, the best DSP processors available today deliver slightly more performance in data processing applications as a small Xilinx FPGA. Using a larger device and adding more logic, more parallel processing power, increases the performance that Xilinx can deliver. The Xilinx XC4085XL, which is the largest FPGA shipping today, provides more than ten times the data throughput rate of the fastest processor currently available. Choose the horsepower you need for your application and add an FPGA, not more processors. 320C6x 4005XL 4013XL 4036XL 4062XL 4085XL XC4085XL > 10x Faster than 320C6x

120 Million Samples per Second 512-Tap Decimating FIR 3.8 Billion MACs >10 DSP uPs 5,120 Flip-Flops Just for data buffer XC4085XL 150,000 Gates 10 bits R E G 1 32-Tap FIR Adder Tree 2 32-Tap FIR 8 32-Tap FIR 10 bits R E G 1 32-Tap FIR The implementation is based on sixteen 32-tap FIR filters all working in parallel. All of the cores are generated by the Xilinx CORE Generator and tied together in a top level design which makes the process of implementing the design simplicity itself. The results, however, are staggering. The design delivers almost 4 billion multiply accumulate operations per second, performance that would require more than 10 high performance processors to match. The data buffer alone would require about 5000 flip-flops, more than are available in most FPGAs. The use of distributed RAM instead of flip flops make this design possible in an FPGA. The equivalent gate count for this design is approximately 150,000 gates. 2 32-Tap FIR R E G 18-bits 8 32-Tap FIR

Price per Million MACs per Second $0.25 $0.20 Price per Million MACs per Second $0.15 $0.10 $0.05 The Xilinx 4000XL family Is based on a 0.35u processing technology and is very cost effective. Latest generations of DSP microprocessors have done a lot to reduce the cost of high performance devices, but even when compared to the cheapest member of the C6X family a programmable solution from Xilinx can be up to one fifth the cost. Further cost reductions can be achieved by migrating the design to a HardWire device which we’ll talk about more later. This can reduce the cost to less than a penny per million multiply-accumulates per second. Add an FPGA, not more processors. Lowest Cost C6x Xilinx XC4000XL

DSP LogiCOREs Exploit FPGA Architecture 16-word RAM F/F Matrix of 16 by 1 RAM primitives Look-up-table logic FIFOs, shift-registers, … Multiple small memories 10,000 RAM primitives on a chip Regular, monolithic, scalable structure Efficient: 1 - 3 Million MACs per CLB

Distributed RAM & Distributed Arithmetic (DA): Perfect Match Basic DA Structure Matches XC4000 Architecture DA Algorithms: 4-Input Look-Up-Tables (LUT) Scaled with adders For higher performance Use more LUTs = more parallelism 4-Input LUT N-bits ADD or ACC. Efficiency similar to custom solution Achievable with LUT logic More ASIC gate equivalents More cost effective 4-Input LUT

Common DSP Functions Filters Transforms Modulation Basics FIR IIR FFT DCT Modulation Multipliers SIN tables Basics Multiply / add Storage

FIR Filter FIR FILTER SUM N BITS WIDE SAMPLE DATA X X X K TAPS LONG X0 C0 X1 X SUM X2 C1 OUTPUT DATA X C2 K SUM’s K TAPS LONG

FIR Filter LogiCOREs Two Basic Types: 1. Serial Distributed Arithmetic FIR SDA FIR - Single Channel SDA FIR - Dual Channel 2. Parallel Distributed Arithmetic FIR Combine basic PDA or SDA FIR cores to solve many problems

Serial Distributed Arithmetic SDA FIR Filters Serial Distributed Arithmetic Parallel In, Parallel Out, Bit-Serial Internally All taps processed in parallel Full precession through entire core One clock cycle required for each data bit One additional clock cycle for symmetric filters EXAMPLE: 10-bit data, 80 taps, symmetrical FIR: For a bit level clock = 90 MHz Max sample rate = 90 MHz / 11 clks = 8.2 Million samples/sec. Process 80 taps every 122 nsec. 656 Million MACs, 257 CLBs, 2.55 Million MACs / CLB

SDA FIR Properties For a Given # of Taps: Coefficient bit-width determines size # CLBs = function of D.A. LUT width Data bit-width determines max sample rate One serial clock per bit Output data width does not effect CLB count

What to Ask Data sample rate Number of taps Data word width Coefficient width Coefficient Symmetry Same input & output sample rate? Number of CLBs

Serial Distributed Arithmetic FIR Filters Data Word = Coefficient Size: # CLBs 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit 8 tap Symm 33 36 39 42 45 52 55 Non 46 54 59 64 69 77 85 16 tap Symm 53 61 69 71 76 81 96 102 Non 80 95 104 112 123 138 142 24 tap Symm 80 89 101 108 116 127 146 154 Non 101 114 127 140 153 174 187 32 tap Symm 93 107 118 126 137 148 175 182 Non 40 tap Symm 116 138 154 165 179 191 226 239 Non 48 tap Symm 158 173 187 202 217 246 261 64 tap Symm 197 215 233 250 268 305 323 80 tap Symm 236 257 278 299 320 364 385 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit Sample Symm 13.3 8.9 7.3 6.2 5.3 4.7 4.2 3.8 Rate Non 16.0 10.0 8.0 6.7 5.7 5.0 4.4 4.0 XC4000E-1 MHz MHz MHz MHz MHz MHz MHz MHz

Distributed RAM is More Efficient For SDA FIR Filters: Distributed RAM is More Efficient Build the Time-Skew Buffer with Distributed RAM not Flip Flops 1 Logic Cell One 16x1 RAM Cell Primitive 16 x 1 Shift Register 16 Logic Cells FF FF FF FF FF FF FF FF 16 x 1 Shift Register

Best Device Utilization Distributed RAM well suited to DSP 1600 SDA FIR Filters 1200 Block RAM Device Size (LCs) 800 Xilinx Distributed RAM 400 Xilinx FPGAs implement DSP functions more efficiently than other FPGA architectures. Let’s look at a benchmark for serial distributed arithmetic FIR filters to highlight the advantages of Xilinx’ distributed RAM over block RAM based architectures. As you would expect, it takes a more logic to build a bigger filter, but with Xilinx FPGAs you get a two to three X area saving for the same functions. This means you can use a cheaper device to implement the function. 16-Taps 8-Bits 16-Taps 16-Bits 64-Taps 9-Bits 64-Taps 16-Bits Xilinx Distributed RAM - Uses One Third the Area

Parallel Distributed Arithmetic FIR Filters PDA FIR Filter Core Parallel Distributed Arithmetic FIR Filters Fully parallel implementation All taps processed in parallel (same as SDA) All bits processed in parallel Up to 100 million samples per second 2 billion MACs per 20-tap core PDA FIR Clock Inputs Outputs Data_IN DATA_OUT CK Cascade Data_Out Mid_Out Mid_In C_M_OUT C_M_IN C_D_OUT

The high data sample rate solution PDA FIR Filters Parameterized Input data: 4 to 24 bits Coefficients: 4 to 24 bits Symmetric, non-symmetric, negative symmetry Output data: 2 to 31 bits Taps: 2 to 20 per core Automatically trims unused coefficient ROMs Supports cascading multiple filter cores The high data sample rate solution

CORE Generator Software SystemLINX: Ability to call CORE Generator from Third Party Tools AllianceCORE: Data Sheets LogiCORE: Web Mechanism to download new cores

One line Documentation

CORE Generator Methodology 1. Select a CORE 2. Enter parameters 3. Generate Core

LogiCORE - SDA Filter Filter Design Package 160 CLB HOW ?

DSP CORE Generator Outputs 32 Tap FIR Filter Schematic symbol VHDL or Verilog HDL instantiation code Simulation model Design netlist with constraints FIR Filter Recipe DSP CORE Generator Parameters 20 rows by 9 columns 160 CLBs used Predictable Performance regardless number of cores

Predictable Size & Performance Built for System Performance - Not Benchmarks. Generated with RPM (Relationally Placed Macro). RPM Macro Level Advantages RPM System Level Advantages Predictable size. Close proximity of communicating elements Alignment of Critical paths Accessible I/O signals Improves Density Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!) Consistent performance anywhere on the die. Packing density very high Adequate set-up times Filling a device with Xilinx Cores does not reduce performance

Performance Independent of core location 80 MHz 80 MHz Same core installed in different locations Xilinx LogiCOREs deliver the same performance for any placement Non-segmented routing FPGAs can’t do this

Performance Independent of Device Utilization 80 MHz 80 MHz 80 MHz 80 MHz Xilinx has performance independent of the number of cores added Non-segmented routing FPGAs can’t do this

Best FPGA Performance Xilinx is more Predictable 80 Xilinx Segmented Non Segmented 70 Speed (MHz) 60 50 Another benchmark based on 12 x 12 multipliers highlights Xilinx’ performance advantage over competitor’s FPGAs. As you add more instances of a Core to a design based on a non-segmented architecture, the performance drops off at an unpredictable rate. If you do the same in Xilinx the performance is essentially the same regardless of how many instances you add. This gives you higher predictability and good repeatability from one design iteration to the next. This is in part due to the segmented routing architecture, but the software also plays a part and we’ll talk about this next. 12x12 Area Efficient Multiplier 40 1 2 3 4 . . . . . . 8 Number of Instances Segmented = More Predictable and Repeatable

Performance Independent of Device Size 80 MHz 80 MHz 80 MHz Same performance for a 4005 or 4085 Non-segmented routing FPGAs can’t do this

Design Flow ~ ~ ~ ~ ~ ~ ~ ~ ~ Mixer Generate each module. 4K x 16 RAM ~ ~ I ~ ~ ~ 4:1 COS 48-TAP FIR 32-TAP FIR Decimate 20 MHz Complex Demod Base-band processor 5 MHz Q ~ ~ ~ ~ 4 multipliers 4:1 SIN Low Pass Mixer Generate each module. Use Schematic or HDL at a system level.

Implementing the Mixer This mixer supports sample rates in excess of 85MHz. It even supports sample rates up to 45.6MHz using the slowest Xilinx device(E-4)

Joining the Cores Here VHDL is used to link the cores into a system. Schematic symbols may also be used. skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk); form_sine_address: for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10); end generate form_sine_address; sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, --select SINE output when high All component declaration and port map code provided by Coregen

Power Dissipation Advantage Often the Limiting Factor In DSP Xilinx Advantage over competitive FPGAs Segmented routing is essential in DSP applications Altera Runs 3X HOTTER than Xilinx! Xilinx advantage over DSP processors: TI Runs 2X HOTTER 320c6 Independent study by Stanford STOP Too Much Heat

Segmented Interconnect Yields Lower Power Ceramic 10 Package Thermal Limit Non-Segmented Xilinx Segmented Power (W) 5 Plastic Power dissipation is an important factor in high-speed DSP applications. Xilinx has a significant advantage here over other FPGAs due to our use of segmented interconnect to implement routing inside the device. Every device package has a thermal limit. A ceramic package can handle a lot of power plastic packages, less so. Due to higher power dissipation, a design implemented with a non-segmented routing architecture hits the wall sooner than the same design implemented in Xilinx devices, which use segmented routing. This means that Xilinx DSP can operate at higher clock frequencies for any given package, or will dissipate less power for a given application. 20 40 60 80 100 Clock Frequency (MHz) Segmented = Lower Power, Faster Operation

Where to find opportunities Look for high performance applications Multiple DSP processors Fixed function DSP parts Gate array / custom DSP Data rates typically above 1 MHz Multiple channels required 100 Million FIR Filter CORE Samples / sec.

DSP Applications Image & Video Processing Communications Industrial, Military Medical Imaging Copiers Cameras Security Systems Video editors Inspection Sys Fingerprint ID Wireless Comm Cellular / PCS Modems Satellite Cable ADSL Telephone Test Motor control Numerical control Test equipment Vibration analysis Power supplies Radar Secure comm.

Where FPGA Solutions Fit Audio RF, Video, Multiple Channels kHz sample rates Single channel Processors Fixed-point arithmetic MHz sample rates FPGAs Fixed-point arithmetic Processors Floating-point arithmetic FPGAs ideal for high sample rates and computational intensity