Spartan-3 FPGA HDL Coding Techniques

Slides:



Advertisements
Similar presentations
What are FPGA Power Management HDL Coding Techniques Xilinx Training.
Advertisements

Basic HDL Coding Techniques
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved FPGA and ASIC Technology Comparison, Part 2.
Verilog Fundamentals Shubham Singh Junior Undergrad. Electrical Engineering.
Combinational Logic.
Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
1 VLSI DESIGN USING VHDL Part II A workshop by Dr. Junaid Ahmed Zubairi.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Dr. Turki F. Al-Somani VHDL synthesis and simulation – Part 3 Microcomputer Systems Design (Embedded Systems)
Evolution of implementation technologies
Achieving Timing Closure. Achieving Timing Closure - 2 © Copyright 2010 Xilinx Objectives After completing this module, you will be able to:  Describe.
Introduction to FPGA’s FPGA (Field Programmable Gate Array) –ASIC chips provide the highest performance, but can only perform the function they were designed.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved Virtex-5 FPGA Coding Techniques, Part 2.
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved Virtex-5 FPGA Coding Techniques Part 1.
Virtex-5 FPGA HDL Coding Techniques
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
Global Timing Constraints FPGA Design Workshop. Objectives  Apply timing constraints to a simple synchronous design  Specify global timing constraints.
Digital Design Strategies and Techniques. Analog Building Blocks for Digital Primitives We implement logical devices with analog devices There is no magic.
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR HDL coding n Synthesis vs. simulation semantics n Syntax-directed translation n.
1 VERILOG Fundamentals Workshop סמסטר א ' תשע " ה מרצה : משה דורון הפקולטה להנדסה Workshop Objectives: Gain basic understanding of the essential concepts.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
© 2003 Xilinx, Inc. All Rights Reserved Reading Reports Xilinx: This module was completely redone. Please translate entire module Some pages are the same.
1 H ardware D escription L anguages Modeling Digital Systems.
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved Spartan-3 FPGA Coding Techniques Part 1.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
1/8/ L20 Project Step 8 - Data Path Copyright Joanne DeGroat, ECE, OSU1 State Machine Design with an HDL A methodology that works for documenting.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
© 2009 Xilinx, Inc. All Rights Reserved Part 2 Virtex-6 and Spartan-6 HDL Coding Techniques.
© 2003 Xilinx, Inc. All Rights Reserved Synchronous Design Techniques.
© 2003 Xilinx, Inc. All Rights Reserved Global Timing Constraints FPGA Design Flow Workshop.
CPE 626 Advanced VLSI Design Lecture 6: VHDL Synthesis Aleksandar Milenkovic
Introduction to FPGA Tools
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
This material exempt per Department of Commerce license exception TSU Synchronous Design Techniques.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Introduction to ASIC flow and Verilog HDL
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved ASIC to FPGA Coding Conversion, Part 2.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU 99-1 Under-Graduate Project Design of Datapath Controllers Speaker: Shao-Wei Feng Adviser:
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
1 Modeling Synchronous Logic Circuits Debdeep Mukhopadhyay Associate Professor Dept of Computer Science and Engineering NYU Shanghai and IIT Kharagpur.
Sequential Logic Design
Class Exercise 1B.
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code.
EMT 351/4 DIGITAL IC DESIGN Week # Synthesis of Sequential Logic 10.
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Part IV: VHDL CODING.
Lecture 15: Synthesis of Memories in FPGA
ASIC 120: Digital Systems and Standard-Cell ASIC Design
Topics HDL coding for synthesis. Verilog. VHDL..
XC4000E Series Xilinx XC4000 Series Architecture 8/98
Basic Logic Gates and Truth Tables
SYNTHESIS OF SEQUENTIAL LOGIC
FPGA Tools Course Answers
Basic Adders and Counters Implementation of Adders
State Machine Design with an HDL
Win with HDL Slide 4 System Level Design
THE ECE 554 XILINX DESIGN PROCESS
FPGA Tools Course Timing Analyzer
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Sequntial-Circuit Building Blocks
THE ECE 554 XILINX DESIGN PROCESS
Presentation transcript:

Spartan-3 FPGA HDL Coding Techniques

Curriculum Path ASIC Design FPGA and ASIC Technology Comparison Intro to VHDL or Intro to Verilog 3 days FPGA and ASIC Technology Comparison Curriculum Path FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion Virtex-5 FPGA Coding Techniques Spartan-3 FPGA Coding Techniques Don’t forget to listen to these FREE RELs… FPGA and ASIC Technology Comparison, Part 2 FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion, Part 1 and 2 Virtex-5 FPGA Coding Techniques, Part 1 and 2 Spartan-3 FPGA Coding Techniques, Part 1 and 2 Fundamentals is a very essential course if you are new to FPGA design. I recommend that all customers take this course every 3-5 years, since the tools change every year. for ASIC Design Fundamentals of FPGA Design 1 day Designing for Performance 2 days Advanced FPGA Implementation 2 days

Welcome This training will help you build efficient Spartan®-3 FPGA designs that have an efficient size and run at high speed We will show you how to avoid some of the most common design mistakes This content is essential if you have never coded a design for any 4-input LUT architecture or are converting an ASIC design

Objectives After completing this module, you will be able to: Optimize ASIC code for implementation in a Spartan-3 FPGA Build a checklist of tips for optimizing your code for the Spartan-3 FPGA

Introduction The design practices described in this module will make your design use fewer resources, run faster, and save you money Can easily reduce the size of your design by 20% or MORE A 20% reduction in size means that your design can fit into a smaller device and have a faster implementation time With the reduction in size it is likely that your design will run at a higher system frequency and potentially save you from purchasing a faster speed grade device Yes! You are free to run small, fast and at low cost! The expert that developed this material described this content as “things that are essential to designers, both experienced and inexperienced”.

Spartan-3 FPGA Slice Carry Logic Multiplexers Function Generators or SET I2 CE Function Generators or Look-Up Tables (LUTs) O D Q I1 RST I0 Flip-Flops I3 SET I2 CE O D Q I1 RST I0

Look-Up Tables I3 I1 I2 I0 O INIT=A8E6 A B sel1 sel0 FNT 00 11 10 01 LUT4 My Logic Selector It does not matter how simple or complex a function is, it is only limited by the inputs I3 I2 I1 I0 O INIT=A8E6 sel1 sel0 FNT A B 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 6 Tip Try to use all four inputs to a LUT Maintain a mental picture of the number of inputs to a function when writing HDL code. The more inputs, the more LUTs E 8 A

Logic Levels and Delay The combination of the interconnect and the LUT forms a logic level 7.4 ns of routing and 1.5 ns of logic LUT4 1.4 I3 LUT4 LUT2 1.5 I2 O I3 I1 1.6 I1 1.9 I2 O O I0 I0 I1 0.5 0.5 I0 0.5 LUT3 1.8 I2 2.1 I1 O Tip The coding style and the synthesis tool define the number of logic levels Lower speed grades have a lower cost Timing constraints can only influence interconnect delays Placing logic closer together is the best way to reduce net delays I0 0.5

Tips Try to maximize the number of inputs to each LUT so that you can obtain the most logic out of the FPGA Instantiate the appropriate LUT primitive, if necessary Refer to the Xilinx Unified Libraries Guide for primitive details A logic level is one LUT plus one net delay Synthesis tools and your coding style will determine how many logic levels there are in every path of your design

Dedicated Multiplexer Elements 3 LUTs = 1½ Slices 2 Logic Levels D2 4:1 Multiplexer 2 LUTs = 1 Slice 1 Logic Level D1 D0 Slice S0 MUXF5 S1 D3 D2 Each slice of the Spartan-3 FPGA provides a dedicated multiplexer called the MUXF5 Saves LUTs and removes a level of logic to increase performance D1 D0 S0 S1

Why Is It Called MUXF5? Slice The F5, when coupled with both LUTs, is able to implement any function of FIVE inputs I3 I2 I1 I0 Tip: You will have to use a case statement to infer MUXF5 Slice I H I4 G The MUXF5 also enables the implementation of many functions, up to nine inputs. Looking for cases where the MUXF5 can be used may lead to lower cost and higher performance F E D C B Tip: You will have to instantiate primitives to build unique functions A

Building Larger Multiplexers Using the F6, F7, and F8 Will need a case statement F8 F6 F7 F6 4:1 Multiplexer 1 Slice (MUXF5) D19 D23 D27 D31 D18 F5 D22 F5 D26 F5 D30 F5 8:1 Multiplexer 2 Slices (MUXF6) D17 D21 D25 D29 D16 D20 D24 D28 You will need a CASE statement to infer these muxes. S4 16:1 Multiplexer 4 Slices (MUXF7) Fx F6 F7 F6 D3 D7 D11 D15 D2 F5 D6 F5 D10 F5 D14 F5 32:1 Multiplexer 8 Slices (MUXF8) D1 D5 D9 D13 D0 D4 D8 D12 S3 S2 S1 S0

Exercise: Dedicated Multiplexers Determine the number of LUTS and logic levels for the following multiplexers Determine the number of LUTS used if the F5, F6, and F7 multiplexers were not available 12 3 8 inputs 5 inputs

Answer 8:1 Mux = 2 Slices … x 12 bits = 24 Slices F5 F5 12 Sel2 3 Sel1 Sel0 8:1 Mux = 2 Slices … x 12 bits = 24 Slices One level of F5 and F6 delay is negligible compared to a LUT and/or net delay

Answer 5:1 Mux = 2 Slices × 12 bits 24 Slices 1 level of logic Vcc 3 Sel2 Sel1 Sel0 Wire delay 5:1 Mux = 1½ Slices 18 Slices × 12 bits 2 levels of logic F5 Sel2 Sel1 Sel0

Tips The F5, F6, F7, and F8 multiplexers build large, fast multiplexers These resources are fast in logic and routing To infer these resources, you will need to use a case statement in your HDL code Verify with your schematic viewer whether they were inferred correctly Inference of unique functions will probably require instantiating the appropriate multiplexer primitives Refer to the Xilinx Unified Libraries Guide for primitive details To break a large multiplexer down into smaller sections for pipelining, be sure to break into 4:1 and 8:1 multiplexers

Enhanced Register Virtex® FPGA-based registers provide clock enables, set, and resets ports directly on the register Sets and resets can be programmed as synchronous or asynchronous and must match All three can be used on any register To directly use the pins on the register, the priority must be: Reset, Set, CE By having these pins directly on the register, the fan-in to the LUT is reduced Without a direct pin on the register, these functions would be implemented through the LUT before the register (that is, you would lose one LUT input)

Enhanced Register Example process (clk) begin if rising_edge(clk) then if reset = ‘1’ then -- synchronous reset data <= (others => ‘0’); elsif set = '1' then -- synchronous set data <= (others => '1'); elsif ce = '1' then -- clock enable data <= data_in; end if; end process; always @ (posedge clk) begin if (reset) // synchronous reset data <= 0; else if (set) // synchronous set data <= 16'hFFFF; else if (ce) // clock enable data <= data_in; end // always @ (posedge clk or posedge reset)

Synchronous Design Replace gated clock circuits with a CE circuit

Clock Enable Circuit always @ (posedge clk) begin count <= count + 1; -- infers a decoded clock enable if (count == 4'b1110) q <= d; end // always @ (posedge clk) process (clk) begin -- process if rising_edge(clk) then count <= count + 1; // Infers a decoded clock enable if count = "1110" then q <= d; end if; end process; Q gets D defines clock enable behavior, so decode output is assigned to the CE port

Synchronous Set and Reset For local reset circuits, use a synchronous set or reset Similar to the gated clock example, an asynchronous signal that provides a set or reset can glitch—propagating erroneous data Synchronous Reset Xilinx uses the following notations with the Xilinx Unified Library registers: R - Synchronous Reset C - Asynchronous Reset (Clear) S - Synchronous Set P - Asynchronous Set (Preset)

Synchronous Reset Example No resets in the sensitivity list; creates a synchronous reset process (clk) begin if rising_edge(clk) then count <= count + 1; if count = "1110" then -- synchronous reset q <= (others => '0'); else q <= d; end if; end process; always @ (posedge clk) begin count <= count + 1; if (count == 4'b1110) // synchronous reset q <= 0; else q <= d; end // always @ (posedge clk)

Control Priority FDRSE Synchronous reset has highest priority – Also means Q=0 following configuration D Q SET RST CE Synchronous set has second priority Clock Enable has lowest priority There is also an FDCPE which has an asynchronous reset and an asynchronous preset with a clock enable. Both primitives will default to an initial state of ‘0’, unless you assign the init attribute to ‘1’. There are also the FDCPE_1 and the FDRSE_1 primitives which have a negative edge clock. FDRSE Tip Write HDL that is designed to infer the intended register Do not mix synchronous and asynchronous controls (not supported) It is important to think about the natural priority of the flip-flops when writing HDL; otherwise, the control features may use LUT inputs

The Effect of a Global Reset The global reset will steal the reset pin, preventing local logic from using it directly. Additional gating logic may be required, increasing cost and lowering performance Steals Reset Ports Determines FF Mode The global reset will determine the mode of the flip-flops as synchronous or asynchronous (typically asynchronous due to coding styles) PRE CE D Q CLR Local Reset net PRE Don’t forget that both register primitives will default to an initial state of ‘0’, so using a global reset is not necessary for simulation. Refer to the TechExclusive article on this at support.xilinx.com. CE D Q CLR Tip Only include a global reset when it is critical to the operation Global resets use a great deal of routing and have considerable skew Make local resets synchronous Do not include a global reset just to fix a simulation issue Global Reset

Flip-Flop Controls This design has an enable, asynchronous clear, and a synchronous set data_in0 D Q reg_data0 CLR Precedence of set prevents clock enable data_in0 D Q reg_data0 What would happen if the asynchronous clear was synchronous? A: both the enable and the clear would be connected to the ports on the register. CLR Design logic will need to use other look-up tables, and the cost will double data_in0 D Q reg_data0 Synchronous set gets mapped to a LUT input CLR Asynchronous clear prevents synchronous set set_in enable_in reset_in

Initialize Registers Initialize all registers in VHDL / Verilog code This should be done whether using a reset or not Perform RTL simulation of the design If it functions during simulation, it should function on the FPGA VHDL: signal my_regsiter : std_logic_vector (7 downto 0) := (others <= ‘0’); Verilog: reg [7:0] my_register = 8’h00; This is an example of initialization. Synthesis tools do react on this code. Most people assume that this unsynthesizable code, but it will affect the init value of the register since that is tied to the registers primitive . This code will reset to a 0 when the GSR is released after configuration. That way you can have a known value in your chip. This will also simulate to a 0 for start up during a functional simulation. If you do not do this and you have a counter, you will have Xs counting +1 and all you have is an x-counter. But if you use this with a counter or anything else, it will start with a known value. This is mandatory if you design without a reset.

Why No Resets at All? Routing can be considered one of the most valuable resources Resets compete for the same resources as the rest of the active signals of the design Including timing-critical paths More available routing gives the tools a better chance to meet your timing objectives You can see how much fanout a typical reset can have. This is probably the biggest reason to remove resets from your design. Can you migrate a reset to the global routing resources? Yes. Instantiate the STARTUP component from the Xilinx Unified Library, add an IBUFG and connect it to the GSR input to the block, and assign the pin to a dedicated clock pin. Just beware of cryptic messages by the ISE tools. The designer must also be aware that the tools will not automatically move resets to dedicated clock pins, and that they have to control this. Note that in the past the implementation tools would infer the GSR signal for true global set/resets automatically. The implementation tools do not do this any more.

Why No Resets at All? Synthesis can infer SRL-based shift registers But only if no resets are used (otherwise flip-flops are wasted) Or, the synthesis tool can emulate the reset (not what you want) The SRL is also useful for synchronous FIFOs, non-binary counters, terminal count logic, pattern generators, and reconfigurable LUTs There is no reset functionality built into the Shift-register LUT. If you have a reset on the shift register you code, the synthesis tools are left with one or two choices. They either do not implement the SRL, meaning they will use a bunch of registers (not what you want), or they try to emulate the reset. Emulating the reset means that they add more logic and it becomes even slower than it should be. In one customer design, there was a global set/reset through everything and the known shift registers could not be connected. They in fact were inferring 100 SRLs and were doing a good job making good use of the dedicated hardware. By removing the global set/rest and obtaining 150 SRLs, the customer obtained even more uses of the SRL, than was recognized. Can you infer an asynchronous reset with the SRL? Yes, but Synplify is the only tool to do that, and it adds logic to emulate the reset which eliminates the benefit of using the SRL. Synplify can also add the glue logic to emulate a synchronous or asynchronous reset when using the DSP48. But likewise, it kills the benefit of saving registers when using the SRL. Can XST infer an asynchronous reset with the SRL? No.

Why No Resets at All? Designs without resets have fewer timing paths By an average of 18 percent fewer timing paths This is important when you consider that synchronous reset paths are automatically timed (this is not a bad thing) Asynchronous reset paths are NOT timed Results in less run time Improved performance Less memory necessary during PAR

Tips Try to manage the number of control signals in your design All three control ports can be used on any register To directly use the pins on the register, the priority must be: Reset, Set, CE Do not gate your clock It will not operate reliably Map this functionality to the CE port Do not build with an asynchronous reset It will not operate reliably if it is local Global asynchronous reset might work, but it might waste LUTs and create a long net delay Do not mix asynchronous and synchronous control signals on the same register Do not use a global reset to make simulation easier Initialize your registers in HDL

SRL16E The shift register is the most powerful mode supported by a LUT Tip: The dedicated flip-flop has a faster clock-to-output time than the SRL16E clock-to-output time via the multiplexer SRL16E LUT4 D I3 CE I2 A3 O I1 A2 The SRL does not support the use of a set or reset signal. It is also ONLY serial in/serial out. No parallel reads are allowed. Q D Q I0 A1 A0 INIT=1234 Tip: The initial contents of the shift register can still be specified or zero will be the default value INIT=1234 D CE Q A[3:0] 0000 1111

Low-Cost Delay For a fixed delay, the address inputs are hardwired to select the appropriate tapping point of the delay line. The CE is also connected to VCC This example provides 8 cycles of delay Vcc SRL16E CE DIN D CE CE CE CE CE CE CE CE CE D D Q D Q D Q D Q D Q D Q D Q D Q A3 A2 A[3:0] Q D Q 0000 0100 A1 ‘0100’ A0 Q CLK GND Tip: There is no set or reset support with the SRL. If you try to infer this, you will get a register implementation

Automatic Delay Ooops! Do not code for a reset with the SRL reg [15:0] data_in ; reg [15:0] delay_1_register ; reg [15:0] delay_2_register ; reg [15:0] delay_3_register ; reg [15:0] delay_4_register ; reg [15:0] data_out ; Ooops! Do not code for a reset with the SRL always @(posedge clk) begin if (reset) delay_1_register <=16'h 0000; delay_2_register <=16'h 0000; delay_3_register <=16'h 0000; delay_4_register <=16'h 0000; data_out <=16'h 0000; end else delay_1_register <= data_in; delay_2_register <= delay_1_register; delay_3_register <= delay_2_register; delay_4_register <= delay_3_register; data_out <= delay_4_register;

Automatic Delay The Reset prevents the SRL16 from being used, so it uses 80 flip-flops Removing the Reset reduces the size to 16 LUTs XST tries to help you... INFO:Xst:741 - A 5-bit shift register was found for signal <data_out<15>> and currently occupies five logic cells (three slices). Removing the set/reset logic would take advantage of SRL16 (and derived) primitives and reduce this to one logic cell (one slice). Evaluate if the set/reset can be removed for this simple shift register. The majority of simple pipeline structures do not need to be set/reset operationally.

Tips The SRL is an effective means of delaying a datapath 1 LUT = 16 flip-flops The SRL supports CE and initialization of its contents The SRL does NOT support set or reset functionality If you code for set or reset, you will get a register implementation This is a waste of registers The SRL is serial in/serial out If you code for a parallel read, you will get a register implementation Experiment with your synthesis tool to determine if it will give you a similar warning to what XST gives you

Summary To infer the dedicated multiplexer resources you will need to use a case statement in your HDL code Verify with your schematic viewer whether they were inferred correctly If you plan to break a large multiplexer down into smaller sections in order to pipeline, be sure to break into 4:1 and 8:1 multiplexers All three control ports can be used on any register To directly use the pins on the register, the priority must be: Reset, Set, CE Do not mix asynchronous and synchronous control signals on the same register

Summary The SRL does not support set or reset functionality If you code for set or reset, you will get a register implementation This is a waste of registers The SRL is serial in/serial out If you code for a parallel read, you will get a register implementation Avoid global resets If you cannot avoid global asynchronous resets, be aware that also using local synchronous resets will end up using more LUTs Local synchronous reset creates a high fanout net (that might create timing problems) when there is also a global asynchronous reset

Where Can I Learn More? Xilinx Online Documents support.xilinx.com To search for an Application Note or White Paper, click the Documentation tab and enter the document number (WP231 or XAPP215) in the search window White papers for reference WP275 – Get your Priorities Right – Make your Design Up to 50% Smaller WP272 – Get Smart About Reset: Think Local, Not Global Xilinx Unified Library Guide From the ISE® Design Suite, select Help  Software Manuals Additional Online Training www.xilinx.com/training Note that when looking for a particular white paper or application note, the letters WP or XAPP must be entered with no space before the numbers (silly). White papers contain concepts or ideas that demonstrate Xilinx product capabilities. Application notes illustrate how to use a Xilinx product in a specialized way.

Spartan-3 FPGA HDL Coding Techniques

Curriculum Path ASIC Design FPGA and ASIC Technology Comparison Intro to VHDL or Intro to Verilog 3 days FPGA and ASIC Technology Comparison Curriculum Path FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion Virtex-5 FPGA Coding Techniques Spartan-3 FPGA Coding Techniques Don’t forget to listen to these FREE RELs… FPGA and ASIC Technology Comparison, Part 2 FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion, Part 1 and 2 Virtex-5 FPGA Coding Techniques, Part 1 and 2 Spartan-3 FPGA Coding Techniques, Part 1 and 2 Fundamentals is a very essential course if you are new to FPGA design. I recommend that all customers take this course every 3-5 years, since the tools change every year. for ASIC Design Fundamentals of FPGA Design 1 day Designing for Performance 2 days Advanced FPGA Implementation 2 days

Welcome This training will help you build efficient Spartan®-3 FPGA designs that have an efficient size and run at high speed We will show you how to avoid some of the most common design mistakes This content is essential if you have never coded a design for any 4-input LUT architecture or are converting an ASIC design

Objectives After completing this module, you will be able to: Optimize ASIC code for implementation in a Spartan-3 FPGA Build a checklist of tips for optimizing your code for the Spartan-3 FPGA

Inference of Arithmetic Logic Arithmetic logic is implemented by using the dedicated carry chain For access to the dedicated carry chain, the HDL must use arithmetic operators +, –, *, /, >, <, = That is, you will NOT infer the use of the carry chain by explicitly building the arithmetic logic For example, Half_Sum <= A xor B will not infer the carry chain Half_Sum <= A + B will infer the carry chain Like flip-flops, carry logic has common controls, which means that a carry chain does not begin or end half way through a slice . For best density, try to use carry logic in pairs or even number of bits. Tip: For best density, try to use carry logic in pairs or in an even number of bits

Counters To increase performance, try different types of counters Binary: Slow, familiar count sequence; fewest amount of registers One-hot: Fast; uses the maximum number of registers Johnson: Ring counter; fast and uses fewer registers than one-hot encoding LFSR: Fast; pseudo-random sequence and uses few registers

Comparator Logic Comparator operators should be replaced with a simple +/– operator >, < operators sometimes infer slower logic Synplify, Exemplar, and XST are unaffected AND-OR logic can also provide a faster implementation for decoding logic May require significantly more work to code, however Use case statements for building decode logic, but consider implementing these functions as a subtraction (next slide)

Comparator Logic Example reg [7:0] sub; always @ (posedge clk) begin in0_q <= in0; in1_q <= in1; //** instead of this... ** //if (in1_q[6:0] > in0_q[6:0]) //** use this -> ** sub = {1'b0,in0_q[6:0]} - {1'b0,in1_q[6:0]}; if (sub[7]) lrgst <= in1_q; else lrgst <= in0_q; end // always @ (posedge clk) process (clk) variable sub : std_logic_vector (7 downto 0); begin if rising_edge(clk) then in0_r_q <= in0_r; in1_r_q <= in1_r; --** instead of this. . .** --if (in1_r_q > in0_r_q) then --** use this -> ** sub := ('0' & in0_r_q) - ('0' & in1_r_q); if sub(7) = '1' then lrgst_o <= in1_r_q; else lrgst_o <= in0_r_q; end if; end process; Tip: For best density, try to do comparisons with carry logic. Use a subtraction rather than an if-then implementation

Decode Logic Example Instead of this process (clk) begin if rising_edge(clk) then cs0 <= '0'; -- default value cs1 <= '0'; -- default value cs2 <= '0'; -- default value if addr <= "0011" then cs0 <= '1'; elsif (addr > "0011" and addr <= "0111”) then cs1 <= '1'; elsif addr > "0111" then cs2 <= '1'; end if; end process; always @ (posedge clk) begin cs0 <= 0; cs1 <= 0; cs2 <= 0; if (addr <= 4'b0011) cs0 <= 1; else if (addr > 4'b0011 & addr <= 4'b0111) cs1 <= 1; else if (addr > 4'b0111) cs2 <= 1; end // always @ (posedge clk)

Best Decode Solution Use this process (clk) begin if rising_edge(clk) then cs0 <= '0'; -- default value cs1 <= '0'; -- default value cs2 <= '0'; -- default value case (conv_integer(addr)) is when 0 to 3 => -- x“0000” to x”0011” cs0 <= ‘1’; when 4 to 7 => -- x”0100” to x”0111” cs1 <= ‘1’; when 8 to 15 => -- x”1000” to x”1111” cs2 <= ‘1’; when others => null; end case; end if; end process; Use this always @ (posedge clk) begin cs0 <= 0; cs1 <= 0; cs2 <= 0; casex (addr) 4’b00xx : cs0 <= 1; 4’b01xx : cs1 <= 1; 4’b1xxx : cs2 <= 1; endcase end // always @ (posedge clk)

Tips Use the proper arithmetic operator to infer carry logic Consider other counter implementations This will take extra time to construct, but might save some resources Consider using and/or logic (carry logic) for decode logic This will take extra time to construct, but will run faster than a LUT implementation Use case statements for decode logic This simplifies the decoding, save LUTs, and improves speed For best density, try to use carry logic in pairs or an even number of bits

I/O Registers and the DLL/DCM Tip: Guide your synthesis tool to use the IOB flip-flops. Check your results with your schematic viewer Simple Internal Timing IOB registers have a fixed predictable setup time IOB registers have a fast and predictable clock-to-output time Setup time will be longer and net delay needs to be controlled (no IOB register used) Clock to output is slow and net delay needs to be controlled (no IOB register used) DLL (DCM) compensates for the clock propagation delay, leading to fast I/O operations clk IOB registers are there. You pay for them, so use them!

Too Many Clocks? Flip-flops share the same clock signal Use only one clock and use the same active edge (rising edge) of that clock whenever possible Flip-flops share the same enable signal Clock enable is optional on the second flip-flop Flip-flops share the same reset and set signals Set and Reset are optional on the second flip-flop Slice SET CE D Q RST Optional Inverters SET CE D Q RST Tip: Reduce the number of clocks in your design

Apply Your Knowledge What would be the effect of implementing a ripple counter? How many slices would this require? D Q D Q D Q D Q Q0 Q1 Q2 Q3 clk

Answers What would be the effect of implementing a ripple counter? Too many clocks are needed, so you would use too many global routing resources or require routing these clocks on general routing resources How many slices would this require? One for each register and none of these registers could be put in the same slice D Q D Q D Q D Q Q0 Q1 Q2 Q3 clk

Apply Your Knowledge In this design you are using a simple synchronous reset with a counter How will your counter implementation be affected if the design also includes a global asynchronous reset signal? 0 1 D Q Q2 reg [8:0] q; always @(posedge clk) begin if (reset_in) q <= 8'h00; else q <= q + 1'b1; end 0 1 D Q Q1 0 1 D Q Q0

Answer 0 1 FDC always @(posedge clk or posedge asynch_reset) begin if (asynch_reset) q <= 8'h00; else if (synch_reset) else q <= reg_data + 1'b1; end D Q Q2 Flip-flops become asynchronous 0 1 FDC D Q Q1 Synchronous reset formed by masking feedback and adding zero 0 1 FDC D Q Q0 ‘High’ fanout net introduced sync_reset Slightly bigger Performance harder to achieve Increased power consumption Additional LUT in critical path Large net (timing?) global_reset

Loadable Up/Down Counter always @(posedge clk) begin if (count_load) q <= data_in; else if (count_up) q <= q + 1'b1; else q <= q - 1'b1; end load_value1 0 1 Q1 D Q load_value0 0 1 Q0 D Q . count_up count_load Tip: You do not need to consider every detail of the LUT function in design, but you should always consider the number of inputs and special connections

Tips Take advantage of the IOB resources available to you with the Spartan-3 FPGA Register all I/O Remember that you waste what you do not use Do not build ripple counters Avoid global resets If you cannot avoid global asynchronous resets, be aware that also using local synchronous resets will end up using more LUTs (manage your control signal usage) Local synchronous reset creates a high fanout net (which might create timing problems) when there is also a global asynchronous reset Consider the number of inputs and any special connections necessary for more complex functions

Memory-to-Logic Ratio Tip: To move to a smaller device, look for ways to use memory to reduce logic requirements 120 110 100 90 Spartan-II FPGA Spartan-3E FPGA Block RAM Bits-per- Logic Slice 80 You should be very interested in ways to use Block RAM because the smaller devices have a higher percentage of Block RAM than the larger densities. Spartan-3 FPGA 70 60 50 40 30 Spartan-IIE FPGA 20 10 Logic Slices 500 5000 50000 100 1000 10000 (Log Scale) Tip: If you move to a smaller device to reduce cost, you generally move to a device with a higher RAM-to-logic ratio

Block Memory The port enable must be high for any operation to take place DIA DOA DIPA DOPA WEA ADDRA ENA 18 Kb SSRA CLKA The output provides the data from the location defined by the address Data is written on a clock edge when the write enable (WEn) is active. When enabled, note that the DOn and DOPn ports remain unchanged. The block RAM does not tri-state the output. The synchronous set/reset port will reset the output register, but an initial value can be specified so that the output can forced to any value. During a write operation, it can become the value that has just been written (be careful of potentially false timing paths caused by the relatively long delay resulting from reading data that has just been written) or the value previously stored at that location. Use the Block RAM for logical operations and never infer asynchronous memory in a design, it will use too many additional resources. DIB DOB The synchronous set/reset (SSR) will, by default, act as a reset of the output register DIPB DOPB WEB ADDRB ENB SSRB CLKB Tip: Never build an asynchronous memory; it will waste resources Tip: Useful for logical operations

Block Memory Aspect Ratios Parity Bits – For each complete byte (8 bits) of data width, there is an additional bit nominally provided for the storage of parity information. The block cannot calculate or perform parity checking Tip: Do not just think of these bits for parity applications; use them for any data or an additional look-up table output DIA DOA DIPA DOPA WEA ADDRA ENA Note that the 36-bit output configuration prevents access to the dedicated multipliers located next to each block RAM and, therefore, should be used with care. SSRA CLKA DIB DOB DIPB DOPB Data Width Parity Bits Memory Locations Address Width WEB ADDRB 1 2 4 8 16 32 - 1 2 4 16384 8192 4096 2048 1024 512 14 13 12 11 10 9 ENB SSRB CLKB

Spartan-3 FPGA Block RAM The name implies a 16-Kb block RAM, which is determined by the data ports only The first number specifies the width of the ‘A’ port and the second number specifies the width of the ‘B’ port. In each case, the width is the sum of the data bits and the parity bits RAMB16_S18_S18 dp_ram_1024_x_18 ( .DOA (data_A_out), .DOPA (parity_A_out), .DOB (data_B_out), .DOPB (parity_B_out), .ADDRA (address_A), .ADDRB (address_B), .CLKA (clk_A), .CLKB (clk_B), .DIA (data_A_in), .DIPA (parity_A_in), .DIB (data_B_in), .DIPB (parity_B_in), .ENA (1'b1), .ENB (1'b1), .SSRA (1'b0), .SSRB (1'b0), .WEA (we_A), .WEB (we_B) ); .DOB (output_data_B[15:0]), .DOPB(output_data_B[17:16]),

ROM and Initial Values ROM or look-up table By initializing the contents of the block RAM and then ensuring that it is never written, it becomes a ROM Because a LUT is really a 16×1 ROM, you can also view the block RAM as a giant look-up table with eight or more inputs and as many as 16 outputs DO ADDR 10 18 WE Tip: Use block RAM to replace LUTs when designs have a high logic-to-memory ratio, especially when using smaller devices with a higher RAM-to-logic ratio

ROM and Initial Values - VHDL Initial Values – These can be defined directly in VHDL when instantiating the component. However, as these four initialization statements (out of the 72 required) show, it may not be particularly user friendly to design this way INIT_00 => X"022001BA01BAE0190000E01800FF4004810150094117F01000000100C0400080", INIT_01 => X"01BA0274107002741080027410900113072308D4096A0A2C0B0501BA019901EB", INIT_02 => X"10904704480549061000C808C908CA04CB04C002001008D209E90A2E0BFB01BA", INIT_03 => X"01BA0274601801BD02746019013C01BA020801BA01BA02741070027410800274", The CORE Generator™ software allows memory contents to be defined in formats that are more natural memory_initialization_vector= 00080, 2C040, 00100, 00000, 2F010, 14117, 35009, 18101, 34004, 000FF, 2E018, 00000, 2E019, 301BA, 301BA, 30220, 301EB, 30199, 301BA, 00B05, 00A2C, 0096A, 008D4, 00723, 30113, 01090, 30274, 01080, 30274, 01070, 30274, 301BA, 301BA, Reset – If you do not specify any initial values, the tools will default to initial contents of zero Tip: Configuration is the ultimate ‘global reset’ as it even resets the contents of block RAM

Choose the Best Aspect Ratio Single-Port RAM 4096 locations, 16-bit wide Din Dout 16 WE Addr 12

Choosing the WRONG aspect ratio! 2 (MS-Bits) 12 Choosing the WRONG aspect ratio! Single-Port RAM 4096 locations, 16-bit wide Address 10 (LS-Bits) 1024×16 16 16 DI DO This is the expensive way to do it! WE ADDR CLK 1024×16 16 16 DI DO WE ADDR 16 16 CLK WE Data 1024×16 16 16 16 slices and additional delay DI DO WE ADDR Two slices and additional delay CLK 1024×16 16 16 DI DO WE ADDR CLK

Choosing the RIGHT aspect ratio! Single-port RAM 4096 locations, 16-bit wide (Could be dual port) 4096×4 4 4 RAMB4_s4 DI DO WE Choosing the RIGHT aspect ratio! ADDR CLK 4096×4 4 4 DI DO WE ADDR 16 16 CLK 4096×4 Single-Port RAM 4096 locations, 16-bit wide 4 4 12 DI DO WE ADDR CLK Tip: Choosing the right aspect ratio saves resources and improves speed 4096×4 4 4 DI DO WE Tip: The CORE Generator software implements this nicely ADDR CLK

Exercise: Replacing Logic with Block RAM A design is just too big to fit your preferred device and the larger device is too expensive. However, five of the 12 block RAMs are unused in this logic-intensive design and you wonder if they can be used in some way. Then you notice that part of the design requires a set of 10 counters with the following specifications: Each counter is a relatively simple 9-bit up/down counter used to indicate an angle in degrees. This is slightly complicated by the fact that the counter has an enable, reset, and saturation logic such that it will not count above 359 degrees or below 0 degree. en 9 up/down Q reset 359 Q value en up/down reset Can you outline how the counters can be implemented in the spare block RAMs and make the project meet budgets? Hint – A counter is just a special type of state machine

LUT required if reset to have priority over enable Hex Table WEA Addr Data 11 up / down ADDRA 000 000 Hold zero DOPA 9 001 000 9 8 Q Count Down DOA en ENA 167 166 reset 168 SSRA LUT required if reset to have priority over enable CLKA XXX Aspect Ratio 2048×9 1FF WEB 200 001 SSR Initial value = 000 Count Up 11 up / down ADDRB 366 167 DOPB 9 9 Q 367 167 Hold 359 8 DOB 368 en ENB reset XXX SSRB 7FF CLKB

Tips To get the most out of the FPGA, convert CLB components into block RAM Often helpful for designs that are running out of CLB resources Block RAM is synchronous The parity bit can be used as an extra output Choosing an aspect ratio is important for saving resources and good speed Use the CORE Generator software for building large memories (best aspect ratio) Building creative components with the block RAM resources takes extra effort, but it always saves a lot of CLB resources FSM, counters, and decoders are just some typical applications Constructing these requires creating an initialization table Inferring block RAM has extensive limitations Most designers use the CORE Generator software

Summary Take advantage of the IOB resources available to you Register all I/O Remember that you waste what you do not use Minimize the number of clocks in your design We discussed why to avoid ripple counters Use case statements to build decode logic This simplifies the decoding, save LUTs, and improves speed Avoid global resets If you cannot avoid global asynchronous resets, be aware that also using local synchronous resets will end up using more LUTs Local synchronous reset creates a high fanout net (that might create timing problems) when there is also a global asynchronous reset

Summary Take advantage of carry logic when building decoders Requires you to build with arithmetic rather than logic, but it will run faster Be aware of the number of inputs to your LUTs when you build with a lot of control signals We used the loadable up/down counter as a good example Use the best aspect ratio when building block RAM from scratch The CORE Generator software does anticipates this automatically To get the most out of the FPGA, convert CLB components into block RAM Use block RAM for FSMs, counters, decode logic, and combinatorial functions, for example

Where Can I Learn More? Xilinx Online Documents support.xilinx.com To search for an Application Note or White Paper, click the Documentation tab and enter the document number (WP231 or XAPP215) in the search window White papers for reference WP275 – Get your Priorities Right – Make your Design Up to 50% Smaller WP272 – Get Smart About Reset: Think Local, Not Global Xilinx Unified Library Guide From the ISE Design Suite, click on the Help menu select Software Manuals Block Memory Generator Data Sheet v2.7 (DS512) Start the Core Generator and start to customize the block RAM, click Help This data sheet will explain all there is to know about how to build a block RAM memory Additional Online Training www.xilinx.com/training Note that when looking for a particular white paper or application note, the letters WP or XAPP must be entered with no space before the numbers (silly). White papers contain concepts or ideas that demonstrate Xilinx product capabilities. Application notes illustrate how to use a Xilinx product in a specialized way.

Trademark Information Xilinx is disclosing this Document and Intellectual Property (hereinafter “the Design”) to you for use in the development of designs to operate on, or interface with Xilinx FPGAs. Except as stated herein, none of the Design may be copied, reproduced, distributed, republished, downloaded, displayed, posted, or transmitted in any form or by any means including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior written consent of Xilinx. Any unauthorized use of the Design may violate copyright laws, trademark laws, the laws of privacy and publicity, and communications regulations and statutes. Xilinx does not assume any liability arising out of the application or use of the Design; nor does Xilinx convey any license under its patents, copyrights, or any rights of others. You are responsible for obtaining any rights you may require for your use or implementation of the Design. Xilinx reserves the right to make changes, at any time, to the Design as deemed desirable in the sole discretion of Xilinx. Xilinx assumes no obligation to correct any errors contained herein or to advise you of any correction if such be made. Xilinx will not assume any liability for the accuracy or correctness of any engineering or technical support or assistance provided to you in connection with the Design. THE DESIGN IS PROVIDED “AS IS" WITH ALL FAULTS, AND THE ENTIRE RISK AS TO ITS FUNCTION AND IMPLEMENTATION IS WITH YOU. YOU ACKNOWLEDGE AND AGREE THAT YOU HAVE NOT RELIED ON ANY ORAL OR WRITTEN INFORMATION OR ADVICE, WHETHER GIVEN BY XILINX, OR ITS AGENTS OR EMPLOYEES. XILINX MAKES NO OTHER WARRANTIES, WHETHER EXPRESS, IMPLIED, OR STATUTORY, REGARDING THE DESIGN, INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NONINFRINGEMENT OF THIRD-PARTY RIGHTS. IN NO EVENT WILL XILINX BE LIABLE FOR ANY CONSEQUENTIAL, INDIRECT, EXEMPLARY, SPECIAL, OR INCIDENTAL DAMAGES, INCLUDING ANY LOST DATA AND LOST PROFITS, ARISING FROM OR RELATING TO YOUR USE OF THE DESIGN, EVEN IF YOU HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THE TOTAL CUMULATIVE LIABILITY OF XILINX IN CONNECTION WITH YOUR USE OF THE DESIGN, WHETHER IN CONTRACT OR TORT OR OTHERWISE, WILL IN NO EVENT EXCEED THE AMOUNT OF FEES PAID BY YOU TO XILINX HEREUNDER FOR USE OF THE DESIGN. YOU ACKNOWLEDGE THAT THE FEES, IF ANY, REFLECT THE ALLOCATION OF RISK SET FORTH IN THIS AGREEMENT AND THAT XILINX WOULD NOT MAKE AVAILABLE THE DESIGN TO YOU WITHOUT THESE LIMITATIONS OF LIABILITY. The Design is not designed or intended for use in the development of on-line control equipment in hazardous environments requiring fail-safe controls, such as in the operation of nuclear facilities, aircraft navigation or communications systems, air traffic control, life support, or weapons systems (“High-Risk Applications”). Xilinx specifically disclaims any express or implied warranties of fitness for such High-Risk Applications. You represent that use of the Design in such High-Risk Applications is fully at your risk. © 2012 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.