Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization.

Slides:



Advertisements
Similar presentations
Basic HDL Coding Techniques
Advertisements

Lecture 15 Finite State Machine Implementation
Spartan-3 FPGA HDL Coding Techniques
Registers and Counters
Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
® Xilinx FPGA Architecture Overview. ® Virtex/Spartan-II Top-level Architecture  Gate-array like architecture  Configurable logic blocks.
Programmable Logic Devices
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
Programmable Logic Devices by Abdulqadir Alaqeeli 1/27/98.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
VHDL Synthesis in FPGA By Zhonghai Shi February 24, 1998 School of EECS, Ohio University.
Evolution of implementation technologies
Programmable logic and FPGA
Multiplexers, Decoders, and Programmable Logic Devices
February 4, 2002 John Wawrzynek
Chapter 7 - Part 2 1 CPEN Digital System Design Chapter 7 – Registers and Register Transfers Part 2 – Counters, Register Cells, Buses, & Serial Operations.
ELEN 468 Advanced Logic Design
CMPUT Computer Organization and Architecture II1 CMPUT329 - Fall 2003 Topic: Internal Organization of an FPGA José Nelson Amaral.
ENGIN112 L26: Shift Registers November 3, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 26 Shift Registers.
Introduction to FPGA’s FPGA (Field Programmable Gate Array) –ASIC chips provide the highest performance, but can only perform the function they were designed.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
Global Timing Constraints FPGA Design Workshop. Objectives  Apply timing constraints to a simple synchronous design  Specify global timing constraints.
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR HDL coding n Synthesis vs. simulation semantics n Syntax-directed translation n.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
© 2003 Xilinx, Inc. All Rights Reserved FPGA Design Techniques.
Section II Basic PLD Architecture. Section II Agenda  Basic PLD Architecture —XC9500 and XC4000 Hardware Architectures —Foundation and Alliance Series.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Programmable Logic Devices
SEQUENTIAL CIRCUITS Component Design and Use. Register with Parallel Load  Register: Group of Flip-Flops  Ex: D Flip-Flops  Holds a Word of Data 
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
© 2003 Xilinx, Inc. All Rights Reserved Synchronous Design Techniques.
Basic Sequential Components CT101 – Computing Systems Organization.
ENG241 Digital Design Week #8 Registers and Counters.
Programmable Logic Training Course Project Manager.
RTL Hardware Design by P. Chu Chapter Poor design practice and remedy 2. More counters 3. Register as fast temporary storage 4. Pipelined circuit.
Tools - Design Entry - Chapter 4 slide 1 FPGA Tools Course Design Entry.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
This material exempt per Department of Commerce license exception TSU Synchronous Design Techniques.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
1 COMP541 State Machines – 2 Registers and Counters Montek Singh Feb 11, 2010.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
Digital Logic Design Basics Combinational Circuits Sequential Circuits Pu-Jen Cheng Adapted from the slides prepared by S. Dandamudi for the book, Fundamentals.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL FPGA Devices ECE 448 Lecture 5.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
Lab5-1 張明峰 交大資工系 Lab 5: FSM and BCD counters Implement the vending machine of lab 2 A two-digit BCD counter –two BCD counters –can load data in parallel.
INF3430 / 4431 Synthesis and the Integrated Logic Analyzer (ILA) (WORK IN PROGRESS)
Introduction to the FPGA and Labs
Sequential Logic Design
Computer Architecture: Intro Beginnings, cont.
Registers and Counters
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Combinatorial Logic Design Practices
Topics HDL coding for synthesis. Verilog. VHDL..
We will be studying the architecture of XC3000.
Xilinx FPGA Architecture
The Xilinx Virtex Series FPGA
XC4000E Series Xilinx XC4000 Series Architecture 8/98
SYNTHESIS OF SEQUENTIAL LOGIC
FPGA Tools Course Answers
CSE 370 – Winter Sequential Logic-2 - 1
Xilinx FPGA Architecture Overview
The Xilinx Virtex Series FPGA
Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016
Presentation transcript:

Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization

Tools - Hardware Optimization - Chapter 12 slide 2 Version 1.5 In This Chapter, You Will Learn Design techniques to optimize performance –Logic Techniques –Special Xilinx Hardware Features Topics apply to both synthesis and schematic users

Tools - Hardware Optimization - Chapter 12 slide 3 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 4 Version 1.5 Combinatorial Resource Review How is a 9 - input AND gate implemented in a CLB? –Three stages shown below explain the mapping process FFX FFY O CLB o

Tools - Hardware Optimization - Chapter 12 slide 5 Version 1.5 Wide MUXes implemented in LUTs have many levels of logic –BUFT Multiplex function uses SRAMs to decode select signals and internal tri-state buffers –Fewer CLBs are used and routing congestion is decreased BUFT delay varies with size of FPGA Small 4-to-1 MUX is shown below –Example: BUFT implementation Three state MUX O BUFT D S BUFT D BUFT D BUFT D D2_4E

Tools - Hardware Optimization - Chapter 12 slide 6 Version 1.5 BUFT Multiplexers BUFT can be used to build large MUXes –Wide MUXes composed of LUTs need multiple levels of logic –Wide MUXes composed of BUFTs use SRAMs to decode select signals and internal tri-state buffers MUX should be built across one row of CLBs Standard library Multiplexer macros use Look-Up Tables –Example: 4 to 1 MUX with enable, M4_1E, is built with CLBs LogiCORE MUXes with Style = WAND use BUFTs Xilinx Unified library BUFT components –BUFT, BUFT4, BUFT8, BUFT16 Synthesis tools: a BUFT MUX will be generated in all synthesizers whenever an IF-THEN type statement drives a high-Z. Otherwise CLB MUXes are generated.

Tools - Hardware Optimization - Chapter 12 slide 7 Version 1.5 Carry Logic Each CLB contains dedicated arithmetic logic for fast carry and borrow signals –Carry logic is associated with F and G function generators Carry logic components have a vertical orientation –Needed for speed and utilization –Known as RPM or “Relationally Placed Macro” –Examples: *ADDx adders *ADSUx adder/subtractors *CCx counters *COMPMCx magnitude comparators A B A B A B A B Z ADD4

Tools - Hardware Optimization - Chapter 12 slide 8 Version 1.5 Counters Libraries support a wide variety of fast and efficient counters –Counters offer trade-offs between speed, utilization, and complexity –Example: LogiBlox counter styles *Binary : slow and large *Johnson : fastest practical counter, uses few Flip-Flops *LFSR : fast & dense, but pseudo-random outputs *One-Hot : useful for generating series of enables *Carry Chain: High speed and utilization –Synthesis tools select a component based on the design, or the designer can instantiate a component using LogiBLOX.

Tools - Hardware Optimization - Chapter 12 slide 9 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 10 Version 1.5 Global Clock Buffers Global Buffers are low-skew, high-drive buffers –Drive low-skew, high-speed long line resources –Drive all Flip-Flops and Latches in FPGA –Can also be used for high-fanout non-clock signals –Check device for number of clocks To use the global buffer, instantiate the BUFG component For synthesis: Clocks are identified by different means depending on Vendor –Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins *Control clock buffer insertion with separate commands *Consult Synthesis interface guide or vendor

Tools - Hardware Optimization - Chapter 12 slide 11 Version 1.5 Each register can be configured as a Flip- Flop or Latch Independent clock polarity Asynchronous Preset or Clear Synchronous Set or Reset Clock Enable Direct input from CLB input (Connections bypass LUTs) CLB Registers S/R DIN F G K (CLOCK) EC (CLOCK ENABLE) RESET SET Q QX D H EC 1 S/R Control F G RESET SET Q QY D H EC 1 S/R Control

Tools - Hardware Optimization - Chapter 12 slide 12 Version 1.5 CLB Flip-Flop features include Asynchronous Preset/ Clear or Synchronous Set/Reset –Synchronous Set/Reset is implemented in LUT –Asynchronous Clear/Preset has two sources *Dedicated Global Set/Reset (GSR) net *Local Asynchronous Preset/Clear D Q Reset Local Async. Preset/Clear Q CLK D Synch. Set/Reset GSR D FDC CLB Set and Reset Capabilities LUT

Tools - Hardware Optimization - Chapter 12 slide 13 Version 1.5 Global Reset (1) All Flip-Flops are always initialized during power up –Via the Global Set/Reset network You can access this network by instantiating the STARTUP primitive –GSR is automatically connected to all CLB Flip-Flops using dedicated routing resources - in general you don’t need to connect Startup to Flip-Flops –GSR, GTS, and Clock can be driven by internal signals or pins –Assert GSR for global set or reset, GTS controls Tri-state buffer in IOBs –Can be driven by internal signals or pins –Saves general use routing resources for the design GSR GTS CLK Q2 Q3 DoneIn STARTUP Q1 Q4

Tools - Hardware Optimization - Chapter 12 slide 14 Version 1.5 Global Reset (2) Use Global Reset whenever possible –Local asynchronous reset is routed on general purpose interconnects –Global Set/Reset is routed on dedicated interconnects –Any signal or pin can drive the global set/reset pin To use global reset network, Register Reset and Startup RST pin must be driven by the same signal. Examples: Bad example: general purpose routing is used Improved example: general purpose routing is not used Startup To Flip- Flops Startup To Flip- Flops Startup Or Good for simulation; extra connections will be trimmed by Design Manager GSR

Tools - Hardware Optimization - Chapter 12 slide 15 Version 1.5 Flip-Flop Clock Enable (1) Register output does not change when clock enable is disabled Allows synchronous design Use instead of gating the clock signal Clock enable is implemented in two ways: –Directly inside the flip-flop via dedicated CE pin –In a Look-Up Table RESET SET Q QX D EC

Tools - Hardware Optimization - Chapter 12 slide 16 Version 1.5 Clock Enable Example Use clock enable when using most of or all logic inputs –Avoid gating of clock signal directly Use MUXed data when using only 1-2 logic inputs or for a gated clock enable –Or, when two different clock enables must drive Flip-Flops in one CLB DQ CE FDxE DQ CE Use Clock Enables Instead of Gating Clock

Tools - Hardware Optimization - Chapter 12 slide 17 Version 1.5 Minimize the Number of Clocks Use Clock enable to reduce the number of clocks. Example with two clocks: Consider using clock enable instead of a clock –Useful when: *CLK2 is much slower than CLK1 *Or, CLK1 and CLK2 have a definite phase relationship FF1FF2 OUT1 CLK1 X CLK2 FF1FF2 OUT1 CLK1 X CLK2 CE

Tools - Hardware Optimization - Chapter 12 slide 18 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 19 Version 1.5 RAM Provides 16X the Storage of Flip-Flops 32 bits versus 2 bits of storage –Two 16x1 RAMS or One 32X1 Single-Port Ram fit in one CLB –One 16x1 Dual-Port RAM fits in one CLB 32x8 shift register with RAM = 11 CLBs –Using Flip-Flops, takes 128 CLBs for data alone 32 bits A0 A1 A2 A3 A4 O1 2 bits DQ DQ Q1 Q2 CLB D1 D2 WE CLK D1

Tools - Hardware Optimization - Chapter 12 slide 20 Version 1.5 General RAM Guidelines Less than 32 words gives fastest performance –32x1 or 16x2 RAM fits in one CLB *Delays are short (one level of logic) –Data and output MUXes are required to expand depth Less than 256 words recommended per RAM –Exceptions include T1 Framers, which use RAMS as a shift register Width easily expanded –Connect the address lines to multiple blocks Recommendation: Use less than 1/2 of max memory resources –Maximum memory uses all logic resources of CLBs

Tools - Hardware Optimization - Chapter 12 slide 21 Version 1.5 Memory Use Most synthesis tools can synthesize ROM from behavioral HDL code RAM memories may be synthesized –Synplicity can synthesize RAMs Use library primitives and macros for standard size memory –RAM/ROM16X1S to 32X8S –Use S suffix for Synchronous RAM –Use D suffix for Dual-Port RAM Use LogiBLOX to generate custom size memories O[7:0] RAM16X8S A0 WE D[7:0] WCLK A3 A1 A2

Tools - Hardware Optimization - Chapter 12 slide 22 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 23 Version 1.5 IOB Block Diagram Three-state output Registered Input or Output Bi-directional I/O Output Slew Rate control Programmable setup/hold delay FF FF or LATCH IN OUT DELAY FAST LATCH SLEW RATE CONTROL PULL-UP PULL-DOWN PAD

Tools - Hardware Optimization - Chapter 12 slide 24 Version 1.5 IOB Flip-Flops and Latches Synthesis tools and Design Manager can move internal registers into IOBs to meet timing constraints Flip-Flops and Latches can be used in unbonded IOBs Use IOB Flip-Flops: –When all CLB Flip-Flops are used –To minimize the Flip-Flop-to-PAD delay –Minimize skew between outputs IO Blocks contain minimal combinatorial logic –IOB Flip-Flops can be used as part of an internal shift register –Do not use IOB Flip-Flops as part of a pipeline Library components begin with I - Examples: ILD, IFD16 Outputs components begin with O - Examples: OFD, OFDT16

Tools - Hardware Optimization - Chapter 12 slide 25 Version 1.5 Instantiation: Use OBUFE and OBUFT components –OBUFT output is in the high impedence state when OE is low Synthesis: If-Then statements driving a Hi-Z value onto an output may be synthesized into an OBUFE or OBUFT Three-state control also via a dedicated global net –Needed for configuration –Also controlled by GST on STARTUP primitive Output Three-State Control OEOE OBUFE T IN T OUT X 1 Z IN 0 IN

Tools - Hardware Optimization - Chapter 12 slide 26 Version 1.5 Small functions can be built into the IOB –Can be used as a generic two-input function generator or MUX –One input can be driven by IOB output clock signal –Requires library components beginning with “O”. *Examples: OAND, OMUX –F input pin is faster than IO pin –Does not apply to all FPGAs Output Combinatorial Logic F OPAD FAST OAND2 IO

Tools - Hardware Optimization - Chapter 12 slide 27 Version 1.5 Guidelines for IOB use Unused IOBs: –Outputs of unused IOBs are automatically disabled –Pull-ups are automatically connected on unused IOBs Used IOBs: –A PULLUP or PULLDOWN primitive can be connected to used IOBs –Inputs should not be left floating *Add a pull-up to design inputs that may be left floating to reduce power and noise Output drive –12 mA Sink current per output on most families –Two adjacent outputs can be tied together to double the drive off chip

Tools - Hardware Optimization - Chapter 12 slide 28 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 29 Version 1.5 Use synchronous design Pipelining improves speed –Consider wherever latency is not an issue –Use for terminal counts, carry lookahead, etc. How to estimate the number of logic levels per stage Example for 100 MHz clock frequency in XC4013XL-09: Clock period10 ns One level- 4.1 ns (t CO + t NET + t SU ~= ns) Delay allowance7.9 ns Each added level / 3.2 ns (t PD + t NET ~= ) Additional levels of logic allowed2 CLBs –Why isn’t the SRAM in the CLB included in the delay calculation? Pipeline for Speed t CO t NET t PD t NET t PD t NET t SU CLB

Tools - Hardware Optimization - Chapter 12 slide 30 Version 1.5 Pipeline Example Break up combinatorial logic into separate stages –Clock frequency increases –Latency also increases - extra cycle(s) are added Example: Frequency can double by adding another stage, but an extra cycle is added * + a b c out * + a b c

Tools - Hardware Optimization - Chapter 12 slide 31 Version 1.5 Example - Optimization is limited because hierarchical boundaries prevent sharing of common terms The path from Reg A to Reg C is divided between three different block descriptions ABC B C A Reg A Reg C No Hierarchy in Combinational Path Keep Related Logic Together (1)

Tools - Hardware Optimization - Chapter 12 slide 32 Version 1.5 Related combinational logic drive registers in the same block No hierarchical boundaries between combinational logic and registers – Allows for improved sequential mapping Keep Related Logic Together (2) Good Example B & C AC Reg A Reg C A

Tools - Hardware Optimization - Chapter 12 slide 33 Version 1.5 Register All Block Outputs Align block boundaries on Register outputs – Helps floorplanning Poor partitioning Good partitioning – Sum is not registered, and may become a critical path. a0 clk a1 clk + sum + a0 a1 clk sum – Why is performance improved when combinatorial logic drives a register in the same CLB?

Tools - Hardware Optimization - Chapter 12 slide 34 Version 1.5 Duplicate Registers to Reduce Fanout Why does fanout reduction improve performance? Register has 24 loads Each Register has 12 loads en clk [23:0]out... en clk [23:0]out... en clk...

Tools - Hardware Optimization - Chapter 12 slide 35 Version 1.5 Counter Tips (1) Do not use binary sequence if unnecessary Consider higher performance or smaller counter types –Examples: LFSR, Pre-scaled, Gray Use Pre-Scaling on non-loadable counters to increase speed –LSBs toggle quickly –See Application Notes XAPP001 and XAPP014 Large Dense Counter with Slower Carry TC CE Fast Small Counter

Tools - Hardware Optimization - Chapter 12 slide 36 Version 1.5 Counter Tips (2) Use Gray code counters if decoding outputs –Glitch free, because one-bit changes per transition Consider Linear Feedback Shift Register for speed when terminal count is all that is needed –Or when any regular sequence is acceptable (e.g., FIFO) 10-bit SR Q0Q9Q6

Tools - Hardware Optimization - Chapter 12 slide 37 Version 1.5 State Machine Design Tips(1) Use One-Hot Encoding for small state machines –Shift-register like structure –One Flip-Flop is assigned to each state –Works well in Xilinx “register-rich” FPGAs –Number of required Flip-Flops may be higher than other state machines, but logic to generate state is less complex –RAMs can be used to encode large state machine Prototype OHE State Machine: Qx, Qy, and Qz are composed of state variables from previous states FF D Q I1 In Qx Qn FF DQ I1 In Qy Qn + 1 FF D Q I1 In Qz Qn + 2

Tools - Hardware Optimization - Chapter 12 slide 38 Version 1.5 Split complex states Need to minimize number of inputs, not number of Flip-Flops, in FPGAs –Use One-Hot encoding for medium to large state machines (greater than 12 states) Complex states may be improved by breaking up into additional simpler states State A State A1 State A2 State B cond1 State B cond1 State Machine Design Tips(2)

Tools - Hardware Optimization - Chapter 12 slide 39 Version 1.5 Consider a pipeline: break the state machine into two or more clock cycles –Two clock cycles for a state is better than having to slow the clock for the entire state machine –This basically means to breakup wide input equations using intermediate nodes in the state diagram. State Machine Design Tips(3) State C State B State A State A State C

Tools - Hardware Optimization - Chapter 12 slide 40 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 41 Version 1.5 Summary Use Tri-state buffers for multiplexing Carry Logic is not the only way to create fast arithmetic functions Use the GSR net to save routing resources and use global routing resources Use Clock Enable port on registers to design synchronously and save logic Best memories are <=32 words Use LogiBLOX to customize memories Use IOB registers for modules that do not require logic, such as shift registers Refer to LogiBLOX or Design Manager Help for more information on LogiBLOX

Tools - Hardware Optimization - Chapter 12 slide 42 Version 1.5 Questions (1) What problem may occur in this circuit? How can the circuit be improved? DQ TC Q0 Q1 Q2 Binary Counter CK

Tools - Hardware Optimization - Chapter 12 slide 43 Version 1.5 Questions (2) What does GSR stand for? –What component sources the GSR net? –When should the GSR net be used? What component is instantiated to use the Global Clock? Can the Global Clock be synthesized?

Tools - Hardware Optimization - Chapter 12 slide 44 Version 1.5 Questions (3) How many global clocks can be used in an XC4085XL-3? –See the data sheet for the XC4000XL family, available on WEB or the AppLINX CD. Why is one hot encoding a good way to encode a small state machine? When should IOB registers be used? When should they be avoided?