DSP for FPGA SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic
Objectives Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow
What Is an FPGA? Field Programmable Gate Array Device that Has a Regular Architecture (Set of Blocks) that Can Be Programmed for Various Functions “Glue” Logic Customizable Hardware Solution Configurable Processors
Why Use FPGAs in DSP Applications? 10x More DSP Throughput Than DSP Processors –Parallel vs. Serial Architecture Cost-Effective for Multi-Channel Applications Flexible Hardware Implementation Single-Chip Solution –System (Hardware/Software) Integration Benefits FPGA Software Embedded Processor FPGA DSP System Software DSP
MAC Can implement hundreds of MAC functions in an FPGA Parallel implementation allows for faster throughput – 200 Tap FIR Filter would need 1 clock cycle per sample 1-8 Multipliers Needs looping for more than 8 multiplications Needs multiple clock cycles because of serial computation 200 Tap FIR Filter would need 25+ clock cycles per sample with an 8 MAC unit processor MAC High Speed DSP Processor High Level of Parallel Processing in FPGA DSP Processors vs. FPGAs
100 - Complete Hardware Implementation Performance (MMACs/sec) Embedded Processors Embedded Processors Hardware Acceleration New! Extending Range of Altera Reconfigurable DSP Solutions
DataProgrammable DSP ProcessorsReconfigurable DSP Benefits Easy to Use Programmed Via C-Code or Assembly Fast Development Time Easy to Use Programmed via C-Code, Assembly, or HDL Efficient for Recursive Algorithms Using DSP IP Cores Higher Levels of Integration Weaknesses Fixed Architecture Inefficient for Highly Recursive Algorithms Unless Hardware Accelerated Potential Bus Bottlenecks Other Devices (FPGAs) Often Used on Board for Other Functions Longer Development Time (But Getting Shorter!) Comparison of DSP Devices
Objectives Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow
Stratix EP1S10 [2]
TriMatrix™ Memory [1] M512 Blocks M4K Blocks M-RAM Dedicated External Memory Interface Look-Up Schemes Packet & Cell Buffering Cache More Bits For Larger Memory Buffering More Data Ports for Greater Memory Bandwidth Small FIFOs Shift Register Rake Receiver Correlator FIR Filter Delay Line Header / Cell Storage Channelized Functions ATM cell–packet processing Nios Program Memory Packet / Data Storage Nios Program Memory System Cache Video Frame Buffers Echo Canceller Data Storage 512 bits per block + parity 4 Kbits per block + parity 512 Kbits per block + parity
Memory Bandwidth Summary Stratix Device Family [1] DeviceTotal RAM Bits M-RAM Blocks M4K Blocks M512 Blocks Maximum Bandwidth (Mbps) EP1S10920, ,245,024 EP1S201,669, ,096,928 EP1S251,944, ,894,400 EP1S303,317, ,750,192 EP1S403,423, ,384,800 EP1S605,215, ,762,528 EP1S807,427, ,784,720
Logic Element (LE) [2] Sync Load & Clear Logic D DATA 4-Input LUT Register Control Signals Register Chain Input Register Chain Output LUT Chain Output data1 data2 data3 data4 cin Row, Column & DirectLink Routing Local Routing Note: 1)Functional Diagram Only. Please See Datasheet for more Details. 2)Addnsum & data1 connected via XOR logic LUT Chain Input Register Feedback addnsub (2)
Dynamic Arithmetic Mode Sync Load & Clear Logic D DATA Register Control Signals Register Chain Input Register Chain Output data1 data2 addnsub Row, Column & DirectLink Routing Local Routing Note: Functional Diagram Only. Please See Datasheet for more Details. Carry-Out Logic Carry-In Logic LAB Carry-In Carry-In0 Carry-In1 Sum Calculator Carry Calculator data3 Carry-In0 Carry-In1 Carry-Out1 Carry-Out0
Logic Array Blocks (LAB) [2] 10 LEs Local Interconnect LAB-Wide Control Signals LE1 LE2 LE3 LE4 LE5 LE6 LE7 LE8 LE10 LE Control Signals Local Interconnect 30 LAB Input Lines 10 LE Feedback Lines
Avalon Switch Fabric Contents Avalon Switch Fabric provides the following to peripherals it connects –Data-Path Multiplexing –Address Decoding –Wait-State Generation –Dynamic Bus Sizing –Interrupt-Priority Assignment –Latent Transfer Capabilities –Streaming Read and Write Capabilities Avalon Switch Fabric tailors transactions to the characteristic of peripherals that are attached
SOPC Design Example DMA Controller With Streaming Control Port (Slave) Read Port (Master – Streaming) Write Port (Master – Streaming) UARTInstruction Memory 32- bit Data path Avalon Switch Fabric Avalon Tri-State Bridge VGA Controller External FLASH 1 MB 16-bit Datapath External SRAM 256 KB 32-bit Datapath Inst Master Data Master CPU 32 Bit Data Memory 32-bit Data path Allows for Masters and Slaves to communicate without knowledge of each others interface details
Data Path Multiplexing & Slave Arbitration DMA Controller With Streaming Control Port (Slave) Read Port (Master – Streaming) Write Port (Master – Streaming) UARTInstruction Memory 32- bit Data path Avalon Switch Fabric Arbiter Avalon Tri-State Bridge VGA Controller External FLASH 1 MB 16-bit Datapath External SRAM 256 KB 32-bit Datapath Inst Master Data Master CPU 32 Bit Data Memory 32-bit Data path MUX 1.Data-Path Multiplexing 2- Slave Arbitration 3- Address Decoding
Objectives Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow
DSP Blocks Eight 9 × 9 bit multipliers Four 18 × 18 bit multipliers One 36 × 36 bit multiplier
DSP Blocks (cont.) The DSP block consists of A multiplier block An adder/subtractor/accumulator block A summation block An output interface Output registers Routing and control signals
Stratix DSP Blocks High Performance Dedicated Multiplier Circuitry –18x18 Functions at 280 MHz Variable Operand Widths with Full Precision Outputs –9x9 (8 Max.) –18x18 (4 Max.) –36x36 (1 Max.) Add, Accumulate or Subtract –Signed & Unsigned Operations –Dynamically Change between Add & Subtract –Supports DSP Requirements Including Complex Numbers + Optional Pipelining Output Register Unit Output Multiplexer + - Input Register Unit
DSP Block for 18 x 18-bit Mode
Shift Register Chain
Adder/Output Block
Time-Domain Multiplexed FIR Filters
Operation of TDM Filter
DSP Block –Reduces LE Usage –Reduces Routing Congestion –Reduces Power –Maintains Performance 90% of your problems are hidden under the surface! 18 X X SAVES 652 ROUTING NETS! Resource Savings with DSP Blocks
Design Flow
Design Flow Overview 1)Create Design in Simulink Using Altera Libraries 2)Simulate in Simulink 3)Add SignalCompiler to Model 4)Create HDL Code & Generate Testbench 5)Perform RTL Simulation 6)Synthesize HDL Code & Place & Route 7)Program Device 8)Signal Tap II Logic Analyzer
Step 1- Create Design in Simulink Using Altera Libraries Drag & Drop Library Blocks into Simulink Design & Parameterize Each Block
Parameterization of IP Megacores
Step 2 - Simulate in Simulink
Step 3 - Add “Signal Compiler” to Model to Generate HDL code APEX20K/E/C APEX II Stratix & Stratix GX Cyclone & ACEX 1K Mercury FLEX10K & FLEX 6000 DSP Boards Speed vs. Area Message Window Leonardo Spectrum Synplify Quartus II Testbench Generation
Step 4 - Create HDL Code & Generate Testbench AltrFir32.vhd AltrFir32.mdl Enable "Generate Stimuli for VHDL Testbench" Button
HDL Code Generation
DSP Builder Report File Lists All Converted Blocks –Port Widths –Sampling Frequencies –Warnings & Messages
Step 5 – Perform RTL Simulation ( ModelSim ) 1) Set working directory (File => Change Directory) 2) Run TCL file (Tools => Execute Macro)
Perform Verification ModelSim vs Simulink
Step 6 - Synthesize HDL & Place & Route – Synthesis Leonardo Spectrum Synplify Quartus II – Quartus II Fitter
Step 7 – Program Device Download Design to DSP Development Kits
Stratix DSP Development Board 40-Pin Connectors for Analog Devices Texas Instruments Connectors on Underside of Board Mictor-Type Connectors for HP Logic Analyzers MAX 7000 Device Analog SMA Connectors D/A Converters A/D Converters Prototyping Area Nios Expansion Prototype Connector
Stratix DSP Board – Key Features Stratix EP1S25F780C5 Device (Starter Version) Stratix EP1S80B956C7 Device (Professional Version) Analog I/O –Two 12-bit, 125 MHz A/D Converters –Two 14-bit, 165 MHz D/A Converters Digital I/O –Two 40-pin Connectors for Analog Devices A/D Converter Evaluation Boards –Connector for TI TMS320 Cross-Platform Daughter Card –3.3V Expansion/Prototype Headers –RS-232 Serial Port Memory –2 Mbytes of 7.5-ns Synchronous SRAM –32 Mbytes of FLASH
Step 8 - SignalTap II Logic Analyzer Embedded Logic Analyzer –Downloads into Device with Design –Captures State of Internal Nodes –Uses JTAG for Communication
SignalTap II Logic Analyzer Imported Data Imported Plot Analysis of Imported Data