Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “A Tutorial“ Greg Goslin Digital Signal Processing.

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

DSPs Vs General Purpose Microprocessors
Programmable FIR Filter Design
Lecture 15 Finite State Machine Implementation
Implementation Strategies
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.
Distributed Arithmetic
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Evolution of implementation technologies
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
Programmable logic and FPGA
XC6200 Family FPGAs By: Ahmad Alsolaim Alsolaim.
February 4, 2002 John Wawrzynek
Distributed Arithmetic: Implementations and Applications
Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
9/20/6Lecture 3 - Instruction Set - Al1 Address Decoding for Memory and I/O.
A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.
Highest Performance Programmable DSP Solution September 17, 2015.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
Chapter # 5: Arithmetic Circuits
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
J. Christiansen, CERN - EP/MIC
Department of Communication Engineering, NCTU 1 Unit 5 Programmable Logic and Storage Devices – RAMs and FPGAs.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Programmable Logic Devices
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Basic Sequential Components CT101 – Computing Systems Organization.
D ISTRIBUTED A RITHMETIC (DA) 1. D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product.
ECE 448: Lab 6 DSP and FPGA Embedded Resources (Digital Downconverter)
ECE 448: Lab 7 Design and Testing of an FIR Filter.
Xilinx Core Solutions Group
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Introduction to the FPGA and Labs
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Embedded Systems Design
Instructor: Dr. Phillip Jones
FPGAs in AWS and First Use Cases, Kees Vissers
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
The Xilinx Virtex Series FPGA
XC4000E Series Xilinx XC4000 Series Architecture 8/98
A Digital Signal Prophecy The past, present and future of programmable DSP and the effects on high performance applications Continuing technology enhancements.
Multiplier-less Multiplication by Constants
Programmable Configurations
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Basic Adders and Counters Implementation of Adders
Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016
Presentation transcript:

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “A Tutorial“ Greg Goslin Digital Signal Processing Applications Manager Corporate Applications Group 15OCT95

Using Programmable Logic to Accelerate DSP Functions 2 Agenda n When to use FPGAs for DSP, an Overview – What is Digital Signal Processing (DSP)? – Where is DSP Used? – Traditional DSP Approaches. n The Promise of Programmable Logic – Case Study: Finite Impulse Response Filter. – Case Study: Viterbi Decoder. n Design Methodologies for DSP in FPGAs – Design Entry and Third Party Software Tools. n Building Fast Filters in FPGAs, a Tutorial – Efficient Algorithms for FPGAs. – Using Distributed Arithmetic for Filter Designs. – How to use an FPGA to Building Filter Designs.

Using Programmable Logic to Accelerate DSP Functions 3 When to Use FPGAs for DSP n High Sample Rates – Up to 66 MHz (off-chip) with XC4000E-2 n Low Sample Rates – Integrate DSP + system logic in a low- cost DSP using serial sequential Distributed Arithmetic algorithms n Short Word Lengths – DA algorithm is faster with shorter word length n Lots of Filter Taps with DA – FPGA processes all taps in parallel, faster than DSP n Fast Correlators n Single-Chip Solution Required n HardWire Gate Array Migration path for high-volume designs

Using Programmable Logic to Accelerate DSP Functions 4 Constraint Driven Design Methodology n Constraints – System Requirements – Hardware Limitations n Data Rate – Inputs – Outputs – Multi-Channel I/O n Quality – Number of Bits/Taps – Number of Opperations – Error Tolerance n Processor Power n Clock Rate Constarint Driven Design methodologies Clock Rate Data Rate Quality Processor Power Options Performance Efficiency

Using Programmable Logic to Accelerate DSP Functions 5 Constraints n Data Rate – Functional Algorithms must opperate at system speed. – Below System Frequency, the design has NO Value. – Above System Frequency, the design has NO added Value. n Quality – Data and Coefficient Bandwidth, m-Bits. – Number of operations within Function, n-Taps. – Error Tolerance, +/- LSB.

Using Programmable Logic to Accelerate DSP Functions 6 Design Implementation n Algorithm Evaluation: – Data Flow Structure –Parallel/Serial Operation –Variable/Constant Operators –Single/Multiple Data Path n Processor Power – Maximum Processing Rate, Device Dependent –Number of Clock Cycles to Perform Algorithm – Bandwidth –Data, Coefficients, Input/Output n Clock Rate – Subdivision of Data Rate Clock

Using Programmable Logic to Accelerate DSP Functions 7 Case Study - Viterbi Decoder n Design Evaluation – Multi-Path Processes – Repeated Independent Functions – Symmetrical Design – While(), For() Loops n Performance – Programmable DSP –24 clock cycles

Using Programmable Logic to Accelerate DSP Functions 8 DSP Design Implementation n Algorithm Evaluation: – Data Flow Structure –Parallel/Serial Operation –Single/Multiple Data Path –Variable/Constant Operators –While() and For() Loops n Processor Power – Maximum Processing Rate, Device Dependent –Number of Clock Cycles to Perform Algorithm – Bandwidth –Data, Coefficients, Output n Clock Rate – Subdivision of Data Rate Clock

Using Programmable Logic to Accelerate DSP Functions 9 FPGA-Based DSP Coprocessor Design Implementation n Performance – Programmable DSP –24 clock cycles – FPGA-Based Coprocessor –9 clock cycles n Results: – 37.5% of original processing time – 2.67X Increase in throughput – System Requirements: –Before: 4-DSPs, 12-RAMs –After: 2-DSPs, 6-RAMs, 1-XC4013E

Using Programmable Logic to Accelerate DSP Functions 10 Building Fast and Efficient Filters in FPGAs n Efficient Filter Algorithms for FPGAs – Distributed Arithmetic: –Serial Sequential –Serial –Parallel n Using Distributed Arithmetic for Filter Designs – Serial FIR Filter Example – Two-Bit Parallel FIR Example – Full Parallel FIR Example n How to use an FPGA to Building Filter Designs – 8-Tap, 8-Bit FIR Filter SLICE

Using Programmable Logic to Accelerate DSP Functions 11 FIR FILTER EXAMPLE X C0C0 X0X0 X C1C1 X1X1 X C2C2 X2X2 SUM 0 K SAMPLE DATA N BITS WIDE K TAPS LONG K COEFFICIENTS K SUMs OUTPUT DATA PRODUCT K Multiplies K Sums CLOCK = Multiply Time Sample Rate = Clock Rate IMPLEMENTATION ??? Sum of Products Equation

Using Programmable Logic to Accelerate DSP Functions 12 X X X C0C0 X0X0 C1C1 X1X1 C2C2 X2X2 SAMPLE DATA N BITS WIDE K TAPS LONG K SUMs OUTPUT DATA FIR FILTER EXAMPLE SUM PROGRAMMABLE DSP CHIP IMPLEMENTATION FIR FILTER SOFTWARE SOLUTION: FOR EACH SAMPLE DATA WORD FOR EACH TAP MULTIPLY C(i) TIMES X(i) ADD RESULT TO ACCUMULATOR 1 Parallel Multiplier, Accumulator Time Share through Microcoding Relatively Low Sample Rates Multiple Chip Solution No Migration Path Complex Real Time Programming

Using Programmable Logic to Accelerate DSP Functions 13 Distributed Arithmetic Made Easy

Using Programmable Logic to Accelerate DSP Functions 14 8-Bit X 8-Bit Signed Multiply B7B6B5B4B3B2B1B0B7B6B5B4B3B2B1B0 S X A7A6A5A4A3A2A1A0A7A6A5A4A3A2A1A0 SIGN EXTEND A 0 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 1 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 2 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 3 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 4 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 5 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 6 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) A 7 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) + S 15 S 14 S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0

Using Programmable Logic to Accelerate DSP Functions 15 X0X0 SAMPLE DATA N BITS WIDE A B Scaling Accum. REGISTERREGISTER FILTERED DATA OUT LOOK UP TABLE ADRS DATA D.A. ONE TAP FIR FILTER = D 0 C 0 REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT C0C0 2 WORD X N BIT LOOK UP TABLE A0 A[0] X1X1 X2X2 X3X3 XnXn D IN N X0(B7B6B5B4B3B2B1B0)X0(B7B6B5B4B3B2B1B0) +X 1 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +X 2 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +X 3 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +X 7 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 15 S 14 S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 S9S8S7S6S5S4S3S2S1S0S9S8S7S6S5S4S3S2S1S0 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +X 4 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +X 5 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +X 6 (B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 14 S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0

Using Programmable Logic to Accelerate DSP Functions 16 D.A. TWO TAP FIR FILTER = D 0 C 0 + D 1 C 1 A B Scaling Accum. REGISTERREGISTER FILTERED DATA OUT LOOK UP TABLE ADRS DATA C0C0 4 WORD X N BIT LOOK UP TABLE c1c1 C 0 + C A[10] X0X0 X2X2 X1X1 XNXN D0D0 SAMPLE DATA N BITS WIDE D1D1 A0 A1 X0X0 X2X2 X1X1 XNXN N (X 0,0,X 1,0 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,1,X 1,1 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,2,X 1,2 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,3,X 1,3 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,7,X 1,7 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 15 S 14 S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 S9S8S7S6S5S4S3S2S1S0S9S8S7S6S5S4S3S2S1S0 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +(X 0,4,X 1,4 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +(X 0,5,X 1,5 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 +(X 0,6,X 1,6 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S 14 S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0

Using Programmable Logic to Accelerate DSP Functions 17 A B Scaling Accum. REGISTERREGISTER FILTERED DATA OUT LOOK UP TABLE ADRS DATA D.A. THREE TAP FIR FILTER C0C0 8 WORD X N BIT LOOK UP TABLE C1C1 C 1 + C C2C2 C 2 + C 0 C 2 + C 1 C 2 + C 1 + C 0 A[210] (X 0,0,X 1,0,X 2,0 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,1,X 1,1,X 2,1 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,2,X 1,2,X 2,2 )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) +(X 0,N,X 1,N,X 2,N )(B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 ) S (N+M)... S 13 S 12 S 11 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 S9S8S7S6S5S4S3S2S1S0S9S8S7S6S5S4S3S2S1S0 S 10 S 9 S 8 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 X0X0 X2X2 X1X1 XNXN SAMPLE DATA N BITS WIDE A1 D0D0 D2D2 D1D1 A0 X0X0 X2X2 X1X1 XNXN A2 X0X0 X2X2 X1X1 XNXN N

Using Programmable Logic to Accelerate DSP Functions 18 The Development of a Distributed Arithmetic FIR Filter 10 Bit 10 Tap - XC4000 Family Example

Using Programmable Logic to Accelerate DSP Functions 19

Using Programmable Logic to Accelerate DSP Functions 20 PARALLEL IN SERIAL OUT SAMPLE DATA N K BIT SHIFT REGISTER SHIFT D_0 D_1 D_k-1 N N BIT SHIFT REGISTER SAMPLE DATA WORD SIZE = N BITS NUMBER OF TAPS = K One N Bit Shift Register Per Tap Use 4000 RAM to build Shift Register One 16 Bit Shift Register Per 1/2 CLB # OUTPUTS = # TAPS PARALLEL IN SAMPLE DATA N K BIT SHIFT REGISTER D_0 N N BIT SHIFT REGISTER RAM16X1R DATA_I A3 A2 A1 A0 WR CLK DATA_O RAM16X1R DATA_I A3 A2 A1 A0 WR CLK DATA_O SHIFT REGISTER IMPLEMENTED IN RAM SERIAL TIME SKEW BUFFER D_k-1 D_1 10 BIT 10 TAP = 50 CLBs 10 BIT 10 TAP = 10 CLBs

Using Programmable Logic to Accelerate DSP Functions 21 Serial Adder D9 D1 D8 D2 D7 D3 D6 D4 D5 ADD Serial Adders D0 ABAB D Clk FF A+B+Carry A + B Carry In Carry CLR 1 CLB Per 2 Taps D Clk FF CNT=10 SUM

Using Programmable Logic to Accelerate DSP Functions 22 DATA LOOK UP TABLE A0 A1 A2 A3 A4 32 X 10 MEMORY 320 BITS DISTRIBUTED ARITHMETIC LOOK-UP TABLE HOLDS ALL PARTIAL PRODUCTS LUT IS AS WIDE AS COEFF CAN USE MEMGEN TO BUILD LUT

Using Programmable Logic to Accelerate DSP Functions 23 1’s COMPLEMENTER INVERTS DATA ON LAST CYCLE 2 BITS PER CLB D Q D Q INVERT D0 D1

Using Programmable Logic to Accelerate DSP Functions 24 SCALING ACCUMULATOR ADDS DATA TO (1/2) *(SUMOUT) 2 BITS PER CLB NEED N+1 BITS DOUBLE PRECISION WITH SR CAN USE XBLOX FOR RPM FORCE CARRY-IN ON LAST BIT A B REGISTERREGISTER SUM OUT Scaling Accum. A10 A9 A8 S1 SUM(0) D IN Shift Reg. 10 Least Significant BYTE Most Significant BYTE OPTIONAL DOUBLE PRECISION S10 S9 A0 10 C_I B(9:0) SIGN EXT B10 LD LOAD ON FIRST BIT DATA

Using Programmable Logic to Accelerate DSP Functions 25

Using Programmable Logic to Accelerate DSP Functions FIVE 2 BIT ADDERS 2 TO 1 REDUCTION DUE TO SYMMETRY SERIAL TIME SKEW BUFFER RAM BASED SHIFT REGISTER RAM OR ROM LOOK UP TABLE 10 ADRS DATA FIR FILTER COEFFICIENTS AND MULTIPLY LOOK UP 32 X REGISTERREGISTER ADDER A B 10 FILTER OUT COMPLEMENT ON LAST CYCLE XOR SCALING ACCUMULATOR 1’S COMPLEMENT 5 CLBs10 CLBs 5 CLBs 7 CLBs SAMPLE DATA 7 CLBs TIMING AND CONTROL 50 MHz CLK CLK A3 A2 A1 A0 CNTEQ10 CNTEQ9 A3 A2 A1 A0 10 BIT 10 TAP FIR FILTER TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN) ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS XC4000 PART NUMBER OF INSTANCES 4002A 4003A 4004A 4005A NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE 9 Most Significant Bits

Using Programmable Logic to Accelerate DSP Functions 27 FIR10B10T DATA IN DATA OUT WORD_CLK CLK_OUT DIN_ DOUT_ Relatively Placed Macro BIT_CLK 10X_CLK PERFORMANCE FIR10B10T MACRO CAN BE CLOCKED AT 50 MHZ 10 BIT WORD REQUIRES 11 CLOCKS 8 BIT WORD REQUIRES 9 CLOCKS, ETC 10 BIT SAMPLE WORD RATE IS 4.5 MHZ WORD SIZE BITS MHZ SAMPLE RATE FIR Filter Macro

Using Programmable Logic to Accelerate DSP Functions 28 Double-Rate DA FIR Filters

Using Programmable Logic to Accelerate DSP Functions 29 n Process 2 Bits per Clock n # of Clocks = (N/2) + 1 n Twice as fast Two Bit Parallel Distributed Arithmetic FIR Filter SAMPLE DATA N BITS WIDE A3 A2 A B Scaling Accum. REGISTERREGISTER FILTERED DATA OUT LOOK UP TABLE ADRS DATA A1 D1D1 X0X0 X2X2 X1X1 XNXN X0X0 X2X2 X1X1 XNXN D0D0 N A C0C0 16 WORD X N BIT LOOK UP TABLE 2C 0 3C A[3210] C1C1 C 2 + 2C 1 C 1 + 3C 0 C 2 + C C 1 2C 1 + 2C 0 2C 1 + 3C 0 2C 1 + C 0

Using Programmable Logic to Accelerate DSP Functions 30 Double Sample Rate D.A. FIR Filters n Two Taps Requires 4 Input LUT without Symmetry n Four Taps Requires 4 Level LUT with Symmetrical FIR n Time Skew Buffer uses Twice as many CLBs n Twice the I/O Data Sample Rate n Both LUTs are the same

Using Programmable Logic to Accelerate DSP Functions 31 Full Parallel DA FIR Filters

Using Programmable Logic to Accelerate DSP Functions 32 Full Parallel Distributed Arithmetic FIR Filter SAMPLE DATA N BITS WIDE D0D0 N C0C0 16 WORD X N BIT LOOK UP TABLE 2C 0 3C A[3210] 4C 0 6C 0 7C 0 5C C 0 10C 0 11C 0 9C 0 REGREG LUT-A ADRS DATA A3 A2 X4X4 X7X7 X6X6 X5X5 A1 A0 A3 A2 X0X0 X3X3 X2X2 X1X1 A1 A0 LUT-A ADRS DATA A B REGREG D1D1 LUT-A ADRS DATA A3 A2 X4X4 X7X7 X6X6 X5X5 A1 A0 A3 A2 X0X0 X3X3 X2X2 X1X1 A1 A0 LUT-A ADRS DATA A B REGREG A B

Using Programmable Logic to Accelerate DSP Functions 33 Full Parallel D.A. FIR Filters n One Taps Requires two 4 Input LUTs and an ADDER n Time Skew Buffer must use REGs n Maximum I/O Data Sample Rate

Using Programmable Logic to Accelerate DSP Functions 34 Large Number of TAPs: 8X - TAP FIR using an 8 - TAP SLICE TSB IN OUT REGISTERREGISTER LUT N N ADD REGISTERREGISTER TSB IN OUT REGISTERREGISTER LUT N ADD N+2 REGISTERREGISTER SCAL ACC REGISTERREGISTER N+1 1’s COM REGISTERREGISTER REGISTERREGISTER 1’s COM N

Using Programmable Logic to Accelerate DSP Functions 35 8 Tap FIR Filter SLICE /2N + 1/2N + ((N+1)/2+1) + ((N+2)/2+1) Number of CLBs per Slice (up to 16 Bit Word) TSB IN OUT REGISTERREGISTER LUT N N ADD REGISTERREGISTER N+2 REGISTERREGISTER SCAL ACC REGISTERREGISTER N+1 1’s COM REGISTERREGISTER N

Using Programmable Logic to Accelerate DSP Functions Tap Filter Using Four 8 Tap FIR Filter SLICE N PSC Bit_Clk New_word Sample Data Data Out ADD REGISTERREGISTER SCAL ACC REGISTERREGISTER REGISTERREGISTER LUT 8 8 ADD REGISTERREGISTER REGISTERREGISTER LUT 8 9 1’s COM REGISTERREGISTER REGISTERREGISTER 1’s COM 8 REGISTERREGISTER LUT 8 8 ADD REGISTERREGISTER REGISTERREGISTER LUT 8 9 1’s COM REGISTERREGISTER REGISTERREGISTER 1’s COM 8 Load TSB IN TSB IN TSB IN TSB IN TSB IN TSB IN TSB IN TSB IN SER ADD SER ADD SER ADD SER ADD

Using Programmable Logic to Accelerate DSP Functions 37 8 Tap FIR Filter SLICE Building Blocks Parallel to Serial Converter N PSC Bit_Clk Byte_Clk REGISTERREGISTER LUT Time Skew Buffer (Quad) Look Up Table TSB IN Bit3 Bit2 Bit1 Bit0 N/2 CLBs 2 CLBs (Up to 16 bit word) N CLBs N Bit ADDer (N/2)+1 CLBs ADD N+1 N N REGISTERREGISTER Serial Adder ADD 1 CLB N Bit SCAL ACCUM REGISTERREGISTER SCAL ACC REGISTERREGISTER N N+1 (N/2)+1 CLBs 1’s COM 1/2 CLB 1’s Complementer

Using Programmable Logic to Accelerate DSP Functions 38 8 Tap FIR Filter SLICE 8 TAPS 16 TAPS 24 TAPS 32 TAPS 40 TAPS 48 TAPS 56 TAPS APPROXIMATE NUMBER OF XC4000 CLBs SAMPLE DATA WORD SIZE (N)

Using Programmable Logic to Accelerate DSP Functions 39 8 Tap FIR Filter SLICE PERFORMANCE with XC SAMPLE DATA WORD SIZE MEGA SAMPLES PER SECOND Sample Rate is Independent of the Number of Taps DOUBLE RATE PERFORMANCE

Using Programmable Logic to Accelerate DSP Functions Distributed Arithmetic 8 Bit Word FIR Filter Sample Rates Number of TAPS 5 Mhz 4 Mhz 3 Mhz 2 Mhz 1 Mhz Word Sample Rate

Using Programmable Logic to Accelerate DSP Functions 41 Number of TAPS # CLBs Serial Sequential Distributed Arithmetic Serial Distributed Arithmetic 8 Mhz 8 Bit Word FIR Filter Structures 1000 to 50 Khz 16 Mhz Two-Bit Parallel Distributed Arithmetic 55 Mhz Parallel Distributed Arithmetic

Using Programmable Logic to Accelerate DSP Functions 42 FIR Filter Implementation Options Serial Distributed Parallel Sequential Arithmetic Parallel 8 Taps 16 Taps 32 Taps 48 Taps 64 Taps 36 CLBs 44 CLBs 250 CLBs 1080 Khz 8.1 Mhz 60 Mhz 36 CLBs 70 CLBs 400 CLBs 462 Khz 8.1 Mhz 55 Mhz 44 CLBs 122 CLBs 231 Khz 8.1 Mhz 62 CLBs 178 CLBs 154 Khz 8.1 Mhz 70 CLBs 228 CLBs 115 Khz 8.1 Mhz 8 Bit Word Example

Using Programmable Logic to Accelerate DSP Functions 43 Lower Sample Rate Applications: Efficient CLB Counts Large Number of TAPs Moderate Sample Rates Non Symmetrical FIR OK Serial Sequential Architecture

Using Programmable Logic to Accelerate DSP Functions Tap 8 Bit Example Coefficient Table REGISTER ADD 2 -1 Scale 32 x 8 LUT Bit Coefficients 8 CLBs SDB Out PSR Parallel to Serial Converter 4 CLBs CLBs 24 CLBs Total Clk 50 Mhz Serial Multiplier Serial Sequential - FIR Filter Select 0 8 Sample Data SAMPLE DATA BUFFER ACC REG SERIAL MULTIPLY Coefficient Select REGREG Filtered Data Out 5-BIT CNTR 5 3 CLBs

Using Programmable Logic to Accelerate DSP Functions TAP Serial Sequential FIR Filter ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER ACC REG SERIAL MULTIPLY Coefficient Select SAMPLE DATA BUFFER ADD REGISTERREGISTER

Using Programmable Logic to Accelerate DSP Functions 46 ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER REGREG Filtered Data Out 8 Tap 16 Tap 32 Tap 48 Tap 64 Tap 80 Tap 96 Tap 128 Tap Bit 10 Bit 12 Bit 14 Bit 16 Bit Number CLBs vs. Taps / Word Size 4002 = 64 CLBs 4005 = 196 CLBs 4013 = 576 CLBs 4025 = 1024 CLBs Serial Sequential - FIR Filter

Using Programmable Logic to Accelerate DSP Functions Khz 625Khz 390Khz 390Khz 312Khz 195Khz 195Khz 156Khz 97Khz 130Khz 104Khz 65Khz 97Khz 78Khz 48Khz 78Khz 62Khz 39Khz 65Khz 52Khz 32Khz 48Khz 39Khz 24Khz 8 Tap 16 Tap 32 Tap 48 Tap 64 Tap 80 Tap 96 Tap 128 Tap TAPS 8 Bit 10 Bit 16 Bit Maximum Sample Rate / Word Size Serial Mult. Limitations Can Use Multiple 16 Tap Serial Sequential - FIR Filter Building Blocks 8X Faster at 128 Taps ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER REGREG Filtered Data Out

Using Programmable Logic to Accelerate DSP Functions 48 ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER ACC REG SERIAL MULTIPLY Coefficient Select SAMPLE DATA BUFFER ADD REGISTERREGISTER 390Khz 312Khz 195Khz 16 Tap 32 Tap 48 Tap 64 Tap 80 Tap 96 Tap 128 Tap TAPS 8 Bit 10 Bit 16 Bit Maximum Sample Rate / Word Size Serial Sequential 16 Tap Slice FIR Filter 16-Tap Slice Used 32-Tap Slice Uses Less CLBs

Using Programmable Logic to Accelerate DSP Functions 49 DESIGN METHODOLOGY XBLOX PROCESSOR SCHEMATIC CAPTURE THIRD-PARTY FILTER DESIGN SOFTWARE XNF CONVERT COEFFICIENTS LOOK UP TABLE GENERATE ROM CONVERT TO XNF BIT STREAM FOR DOWN LOAD CABLE, OR EPROM FORMAT COEFFICIENTS INTO LOOK UP TABLE PARTITION PLACE AND ROUTE POST ROUTE SIMULATION MEMGEN

Using Programmable Logic to Accelerate DSP Functions 50 DESIGN METHODOLOGY SCHEMATIC CAPTURE Filter Blocks can be Embedded in Complete design XBLOX Can Synthesize the Data Path Logic Filter Design Software used to design filter Coefficients Complete System Level Design in a Single Chip Incremental Filter Design Using XACT 5.0

Using Programmable Logic to Accelerate DSP Functions 51 Audio Sample Rates: Don’t need Special DSP Chip Serial Sequential Architecture is efficient RF Sample Rates: Programmable DSP Chip is too slow FPGA is a single chip configurable solution The Right Solution for Most Applications FPGA

Using Programmable Logic to Accelerate DSP Functions 52 XILINX VS. D.S.P. CHIP COMPARISON High Sample Rate Systems Low Sample Rates Small Word Length Lots of Taps Single Chip Solution Required Low Cost Migration Path (HardWire) Incremental Cost of DSP Chip When Does It Make Sense To Use FPGAs? “Design Once”

Using Programmable Logic to Accelerate DSP Functions 53 DISTRIBUTED ARITHMETIC FPGA Applications, Coming Attractions: Signal Synthesis Modulation, De-modulation FFTs Neural Networks Half Band FIR Filters Video Signal Processing

Using Programmable Logic to Accelerate DSP Functions 54 P O S S I B I L I T I E S X.D.S.P. XILINX Hardware Digital Signal Processing There is an Alternative to Software DSP Chip Solutions Today Existing Xilinx 3100, 4000, 4000A,E, & H can Efficiently do Signal Processing System Level Application Specific Solution on a Single Chip Standard Product Configurable Solution Automatic Migration Path to a Lower Cost/High Volume Solution