Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtex-5 FPGA HDL Coding Techniques

Similar presentations


Presentation on theme: "Virtex-5 FPGA HDL Coding Techniques"— Presentation transcript:

1 Virtex-5 FPGA HDL Coding Techniques
Part 1

2 Curriculum Path ASIC Design
FPGA and ASIC Technology Comparison Intro to VHDL or Intro to Verilog 3 days FPGA and ASIC Technology Comparison Curriculum Path FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion Virtex-5 FPGA Coding Techniques Spartan-3 FPGA Coding Techniques Don’t forget to listen to these FREE RELs… FPGA and ASIC Technology Comparison, Part 2 FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion, Part 1 and 2 Virtex-5 FPGA Coding Techniques, Part 1 and 2 Spartan-3 FPGA Coding Techniques, Part 1 and 2 Fundamentals is a very essential course if you are new to FPGA design. I recommend that all customers take this course every 3-5 years, since the tools change every year. for ASIC Design Fundamentals of FPGA Design 1 day Designing for Performance 2 days Advanced FPGA Implementation 2 days

3 Welcome This training will help you build efficient Virtex®-5 FPGA designs that have an efficient size and run at high speed We will show you how to avoid some of the most common design mistakes This content is essential if you have never coded a design for the Virtex-5 FPGA or are converting an ASIC design

4 Objectives After completing this module, you will be able to:
Optimize ASIC code for implementation in a Virtex-5 FPGA Build a checklist of tips for optimizing your code for the Virtex-5 FPGA

5 Introduction There is no single “perfect” way to create a design
Different synthesis options and implementation options will lead to different results One method will NOT work best in all cases The coding techniques described here are strongly recommended because they have the biggest impact on device utilization and speed There are however guidelines that usually lead to improved results The expert that developed this material described this content as “things that generally work for me and work in the designs I've seen”. He specifically chose five designs that were from major customers. The designs targeted LX110 and LX220 Virtex-5 devices. He completed some code changes as well and synthesized and implement with different options to gather the data. This presentation includes some of the statistics he found.

6 Tactics to Meet Timing As always, use as many of the dedicated resources as possible (SRLs, DSP48s, and block RAMs) Different tactics must be used when your device is full Timing does not matter if your design does not fit in the device The tactics that will be discussed generally work best in designs that are not full One of the most effective ways to reduce power in FPGAs is to reduce the number of resources One of the side benefits of these techniques is that they will allow you to improve performance and reduce power If you barely get a design to fit into an FPGA, then the PAR timing options are very limited. So if your design barely fits, you are going to probably need to reduce its size first. Many of the tips we are going to discuss are going to reduce the size of your design.

7 Limiting Virtex-5 FPGA Resources
Build a design that uses fewer “limiting” resources Fewer registers Many designs run out of registers before other components (especially if the design is heavily pipelined) Registers are most often the limiting resource in Virtex-5 designs Fewer LUTs The LUT6 is 40 percent more efficient than a LUT4 But this does not yield performance benefits for every design

8 Virtex-5 FPGA Registers
Why are registers sometimes a scarce resource? The Virtex-5 FPGA has ~30 percent fewer registers for a given logic or array size compared to the Virtex-4 FPGA The 4VLX80 device has 71,680 slice flip-flops versus 51,840 for a 5VLX85 device So you should NOT need to pipeline as much! Be aware of control signal limitations (this will be covered later) Lack of use (inference) of SRLs, block RAMs, and DSP slice resources Replication of registers (logic replication) Careful use of synthesis options that may increase your design size is important

9 Introduction to Control Sets
A control signal is Clock Enable / Gate Enable Write Enable Set / Preset Reset / Clear Clock / Gate Slice: WE, CE, SR, REV, CLK A control set is A group of enable, set, reset, and clock This includes Vcc / Gnd when they are not used Unique control sets are The number of groups of unique control signals that your design has Please note that Xilinx still discourages designers from using latches. The gate enable is a control signal that is only used with latches. Note that Vcc and Gnd are connected to each unused port (set, reset, ce, etc.). So if you are not using the a set, reset, or clock enable, they will be tied to VCC or ground, depending on which level enables. So Vcc and Gnd are also considered control signals. It is important to understand that a unique control set is a group that has a unique membership of elements. This means that if there are two registers and both use the same clock and reset, but different clock enables, then each register is part of a different control set. Tip: The implementation tools cannot group flip-flops into the same slice if they do not share the same control signals

10 What Creates Control Signals?
Control signals are the signals that are connected to the actual control ports on the register Inference code Clocks and asynchronous set/resets always become control signals They cannot be moved to the datapath Clock enables and synchronous set/resets sometimes become control signals (this is decided by the synthesis tool) These control signals can be moved to the datapath How will a global asynchronous reset and a local reset inferred on a single register be implemented? Asynchronous reset gets the port on the register Synchronous reset gets a LUT input Clock enables can be on datapath; you can feedback the output back to an input to the LUT. This makes the CE part of the datapath and, at times, this is a good decision because the CE no longer uses the CE port on the register. But synthesis tools generally make the CE a control signal so that it does not burn any of your LUT inputs. Likewise, synchronous set/resets can be mapped to LUT inputs, rather than using a registers control signal input. As you will see later, having the option to switch between the two implementations is useful since this enables those registers to be part of additional control sets. Tip: Clock enables and synchronous sets and resets can be moved to the datapath

11 What Creates Control Signals?
Instantiation of primitives and cores Gate-level connection of UNISIM and core primitives dictates control signal usage Synthesis optimization Synthesis may choose to build a control signal for logic optimization Physical synthesis Can change control sets from original specifications Global or logic optimization may choose to build a control signal for logic optimization Remember that when you instantiate a primitive, you are going to get what you build. So if you instantiate a register (or a core that uses a register) and connect that port up, it will use that control signal. Physical synthesis is most commonly done with Synopsys Amplify software. The FPGA architecture also dictates what types of control signals can be used. For example, Block RAM has a synchronous output register. If you try to infer an asynchronous output, your synthesis tool wont give you that output register. Tip: The instantiations of cores you make should share the same control signals you infer to minimize the number of control sets

12 Why Be Concerned? Four registers per slice; all share the same control signals If the number of registers in the control set do not divide cleanly by four, some registers must go unused This is of concern for designs that have several very low fanout control signals A design with a large number of control sets potentially can show lower utilization of registers (but not always) Older FPGAs had two flip-flops and two LUTs. So if you wasted registers you would lose one out of the two registers. With the Virtex-5 FPGA, there are four registers per slice. These four registers are tied to the same clock, same clock enable, same set, and same reset. If any one of the control signals change, then those registers cannot be put in the same slice. Tip: evaluate the control signals that go to very few flip-flops. These are the control signals that should be minimized. For example, Xilinx analyzed a customer design that had 5000 control sets in a large FPGA that worked well. There was device utilization of flip-flops, over 90 percent (high). In addition, a lot of clock enables actually went in byte-wide widths. So they were basically byte-enables. They were controlling datapaths at byte-wide structures, and in doing so the design fit beautifully. In fact, for a particular slice, those datapaths always matched up, because the tools try to pack similar types of data and similar types of structures into a slice. This design had almost every register available, because even though it had many unique clock enables, they were all x8. So a large number of control sets does not necessarily mean something is wrong. Tip: Try to build in byte-wide widths for the highest device utilization

13 What Designs Are Okay? Designs with plenty of flip-flops to spare
Designs with low flip-flop-to-LUT ratios These are generally slow or lightly pipelined designs or ASIC prototypes Designs with lots of room in a particular device Designs with a small number of control sets are preferable The key is to evaluate slices and CLBs that have wasted registers Try to build designs with common control signals (plan) Designs with datapaths divisible by four are not affected even if they have a high number of control sets Such as byte-wide enables or data control registers, for example A small number of control sets might be around 50, but this depends on the size of the device. A large number might be 500, but it will depend on how many registers are driven by each set and the size of the device.

14 Active-Low Control Signals
Problem: Active-low control signals can produce sub-optimal results Why? Control ports on Virtex-5 FPGA registers are active-high Hierarchical design methods This results in… Poorer utilization More LUTs Less dense slice packing More routing resources necessary Longer run times Prohibits hierarchical design flows More difficult timing Worse timing and power Active-low signals use more LUTs because they require inversion before they can directly drive the control port of a register. This inversion must be done with a LUT and thus takes up a LUT input. Likewise, because only a single reset can be brought to the register, if the remaining logic to be grouped does not use the same reset, CE, or set, it cannot be grouped into the same slice. If you use hierarchical design techniques (meaning you are using partitions, using keep hierarchy equals true, using cores, using multiple netlists, and using bottom-up synthesis), it can increase the number of control signals because the flip-flops do not have a programmable inverter. With hierarchical design, each signal would have a different name, and you would still have to run each signal through an inverter, which means that you are burning a LUT. Hierarchical design necessitates this, because designers want to keep hierarchy and partition their design. Tip: Use active-high signals for CEs, sets, and resets

15 Use Active-High Control Signals
Flip-Flop The inverters cannot be combined into the same slice This consumes more power and makes timing difficult Remember that the synthesis tools and the implementation tools cannot merge this logic into the same slice because the designer is not allowing optimization across a hierarchical boundary. Hierarchical design methods can proliferate LUT usage on active-low control signals

16 Why Synchronous Resets?
Each DSP48E has ~250 registers; none have asynchronous reset The DSP slice is more versatile than most realize The XC5V50 device has ~12,000 DSP slice registers The XC5V330 device has ~48,000 DSP slice registers Can be used for multipliers, add/sub, MACC, counters (with programmable terminal count), comparators, shifters, multiplexer, pattern match, and many other logic functions Many design that run out of slices are not fully utilizing the DSP48E Synthesis tools will infer the DSP48E for multipliers, but they are not smart enough to infer other functions Can control synthesis use with attributes, but NOT if an asynchronous reset is used It is important to remember that taking advantage of the dedicated hardware resources like this are going to give the user extra slice resources, which will give the tools the ability to reach a higher design speed. SIMD can allow for even greater utilization of the DSP48E resources. Consult UG193: Virtex-5 FPGA XtremeDSP Design Considerations for a more detailed description of what can be accomplished with the DSP48. Tip: Use sync reset when using the DSP slice resources

17 Why Synchronous Resets?
Block RAMs obtain minimum clock-to-output time by using the output registers Output registers only have synchronous resets Unused block RAMs can be used for many alternative purposes ROMs, large LUTs, complex logic, state machines, deep-shift registers, etc. Using unused block RAMs for other purposes can free up hundreds of flip-flops Using the block RAM in dual-port mode allows for greater utilization of this resource Many designs that run out of slices are not fully utilizing the block RAM resources Synthesis tools are not yet smart enough to infer less obvious functions Tip: Use sync reset when using the block RAM resources

18 Why Synchronous Resets?
Synthesis could choose to move low-fanout synchronous resets from a control signal to the datapath to free up more registers Synthesis tools can do this, but it may depend on synthesis settings (may not be on by default) The Xilinx implementation tools cannot change what is synthesized This could allow packing of this register into a slice previously not possible Can improve timing as well as register density This could eventually be the biggest reason. D S Low Fanout

19 Why Synchronous Resets?
Synchronous resets are automatically timed Do not need any special timing constraints Do not need special switches or setting to analyze timing Synchronous resets are inherently more predictable Less susceptible to accidentally missing timing, runt pulses, or other phenomenon from upsetting logical functionality Less prone to a race condition Release of an asynchronous signal may not always have predictable results Synchronous resets are automatically timed and they are inherently more stable. Asynchronous resets are ignored by the implementation tools. If you are using them in your design, they can often lead to disastrous functionality. Synchronous resets do not need any special constraints in your design (unlike asynchronous resets). They also do not require any special switches in the timing analyzer for analysis (unlike asynchronous resets). Chances are that if you use synchronous resets, your design will work out-of-the-box, if the timing analyzer says you are meeting timing and your functional simulation is working. With asynchronous resets, there is much less certainty, because there is a lot of things that can go unnoticed. Tip: Synchronous resets enable your design to need minimal testing

20 Caveats to Synchronous Resets
Synchronous resets may make timing more difficult, the design larger, and result in longer run times Why? The implementation tools automatically time synchronous reset paths This can result in More timing paths to analyze and meet timing On average ~five percent increase in the number of timing paths More replication of design resources With some synthesis tools this will use fewer SRLs, block RAM, DSP48s, and other dedicated hardware If you build a global asynchronous reset and do not put a from/to constraint on the paths that include this net, the implementation tools will not perform any timing analysis on its paths. Likewise, if you do not assert the timing analyzer switch to report asynchronous delay paths, you may not do any analysis of these paths. The implementation tools can also do a poor job of routing these nets as well, because they are not constrained by default. If the reset becomes synchronous, the paths become timed by default. This will increase the number of timed paths (sometimes dramatically). The impact is that the implementation tools will now have much more work to do and this causes greater implementation times (run times). There have been cases where synthesis tools will simplify the synthesis result so that it does not take as much advantage of dedicated hardware when it determines that a design will be harder to meet timing. It basically infers very little to SRLs and block RAMs, for example. So the design gets built out of additional LUTs and flip-flops to try and get as lean as possible, which can make it harder to meet timing.

21 Changing to Synchronous Resets
All new code should use synchronous resets when a reset is necessary For existing code, you have three choices Leave alone Acknowledge the possible drawbacks of asynchronous resets Use synthesis switch Not the same as changing to synchronous reset but can help Manually (or use a script) to change the asynchronous reset to synchronous Removing the top-level reset port does not get the same result Remove the reset from your code If you have the existing code and are not having any trouble meeting timing, are fitting your device, and are happy with the operation of your device then you may choose to do nothing. Synthesis tools hate to give you logic that is not reflective of the code. So be careful with this option. Using these switches is not the same as manually changing code. Also note that some believe that if the global set/reset comes to a top of a port and disconnects the port, the optimization algorithms will rip out the reset. It does not work this way. The synthesis algorithms that do that optimization are after they have chosen what resources are to be mapped. So you still end up with a flip flop that has an asynchronous reset. Synplify: syn_clean_reset XST: -async_to_sync YES

22 Resets No Resets is Best
Most designs do not need a global reset in an FPGA. ASIC devices usually need resets. There are some ASIC technologies that at most require an initialization when they power up. FPGAs inherently have a global set/reset (GSR) that occurs during power up, and it does not need to be coded in to the design. If you are coding it in, you are actually coding in a second reset.

23 Why No Resets at All? Using synchronous logic frees up additional logic Designs in which the resets were removed resulted in an average of 3 ½ percent fewer registers Synthesis can realize this additional logic automatically Removing resets saves LUT inputs. Which enables a designer to get more logic out of the FPGA. Tip: This makes it easier for the mapper to group this register with registers of a different control set

24 Why No Resets at All? Synthesis can infer SRL-based shift registers
But only if no resets are used (otherwise flip-flops are wasted) Or, the synthesis tool can emulate the reset (not what you want) The SRL is also useful for synchronous FIFOs, non-binary counters, terminal count logic, pattern generators, and reconfigurable LUTs There is no reset functionality built into the Shift-register LUT. If you have a reset on the shift register you code, the synthesis tools are left with one or two choices. They either do not implement the SRL, meaning they will use a bunch of registers (not what you want), or they try to emulate the reset. Emulating the reset means that they add more logic and it becomes even slower than it should be. In one customer design, there was a global set/reset through everything and the known shift registers could not be connected. They in fact were inferring 100 SRLs and were doing a good job making good use of the dedicated hardware. By removing the global set/rest and obtaining 150 SRLs, the customer obtained even more uses of the SRL, than was recognized. Can you infer an asynchronous reset with the SRL? Yes, but Synplify is the only tool to do that, and it adds logic to emulate the reset which eliminates the benefit of using the SRL. Synplify can also add the glue logic to emulate a synchronous or asynchronous reset when using the DSP48. But likewise, it kills the benefit of saving registers when using the SRL. Can XST infer an asynchronous reset with the SRL? No. Tip: NO reset saves a lot of flip-flops

25 Why No Resets at All? Routing can be considered one of the most valuable resources Resets compete for the same resources as the rest of the active signals of the design Including timing-critical paths More available routing gives the tools a better chance to meet your timing objectives You can see how much fanout a typical reset can have. This is probably the biggest reason to remove resets from your design. Can you migrate a reset to the global routing resources? Yes. Instantiate the STARTUP_VIRTEX5 component from the Xilinx Unified Library, add an IBUFG and connect it to the GSR input to the block, and assign the pin to a dedicated clock pin. Just beware of cryptic messages by the ISE tools. The designer must also be aware that the tools will not automatically move resets to dedicated clock pins, and that they have to control this. Note that in the past the implementation tools would infer the GSR signal for true global set/resets automatically. The implementation tools do not do this any more. Tip: NO reset saves routing and improves design speed

26 Why No Resets at All? Even more block RAM inference Why?
Virtex-5 FPGA RAMs RAM enable has precedence over reset Virtex-5 FPGA registers Reset has priority over the clock enable Coding for this functionality makes no sense With no reset, the enable precedence has no consequence The CE for block RAM has precedence over a reset, which is different from the priorities control signals have with flip-flops. This will be changed in the next generation of Virtex products. This is also the case with Spartan devices. Tip: NO reset gets more block RAMs

27 Why No Resets at All? Designs without resets have fewer timing paths
By an average of 18 percent fewer timing paths Results in less run time Improved performance Less memory necessary during PAR Tip: NO reset builds a faster design and saves run time

28 How Do I Get By? Some designs can get away without any resets but many designs need some resets Very few designs require resets on all registers Most ASICs require a described reset on every register. Implement this with the built-in Global Set/Reset (GSR) Suggestion Be selective when you code resets (FSMs, I/O, and flushing data) Only place resets that have impact on functionality Xilinx suggests that you selectively remove resets, or even better, selectively put them in. You can tell that you have used sufficient resets when your design simulates properly. If you can functionally simulate an RTL design, it should work in the FPGA.

29 How Do I Get By? Initialize all registers in VHDL / Verilog code
This should be done whether using a reset or not Perform RTL simulation of the design If it functions during simulation, it should function on the FPGA VHDL: signal my_regsiter : std_logic_vector (7 downto 0) := (others <= ‘0’); Verilog: reg [7:0] my_register = 8’h00; This is an example of initialization. Synthesis tools do react on this code. Most people assume that this unsynthesizable code, but it will affect the init value of the register since that is tied to the registers primitive . This code will reset to a 0 when the GSR is released after configuration. That way you can have a known value in your chip. This will also simulate to a 0 for start up during a functional simulation. If you do not do this and you have a counter, you will have Xs counting +1 and all you have is an x-counter. But if you use this with a counter or anything else, it will start with a known value. This is mandatory if you design without reset.

30 Summary If your design barely fits, Xilinx recommends reducing the size of your design before trying to gain timing closure Most of these tips reduce design size Try to minimize the number of control sets your design uses Asynchronous resets can inhibit optimization of general logic (can force additional LUT inputs to be used) Synchronous resets allow synthesis tools to convert a control signal reset to the datapath Avoid the use of global resets Initialize all registers from your HDL If you have to, use the Startup_Virtex5 primitive to access the GSR net

31 Summary (continued) Xilinx recommends NOT using the synthesis option to convert asynchronous resets to synchronous Avoid resets on SRLs (no reset functionality) Avoid asynchronous resets on block RAMs (the block RAM’s output register only supports a synchronous reset) Avoid asynchronous resets on DSP slice resources (their flip-flops only support a synchronous reset) Be aware of the difference between coding for a block RAM’s control signal precedence and a flip-flop’s precedence Use active-high control signals If you can design out your global reset, you will save a lot of routing and build a faster design

32 Where Can I Learn More? Xilinx Online Documents
support.xilinx.com To search for an Application Note or White Paper, click the Documentation tab and enter the document number (WP231 or XAPP215) in the search window White papers for reference WP231 – HDL Coding Practices to Accelerate Design Performance WP248 – Retargeting Guidelines for Virtex-5 FPGAs WP275 – Get your Priorities Right – Make your Design Up to 50% Smaller User guides for reference UG193 - Virtex-5 FPGA XtremeDSP Design Considerations Additional Online Training Note that when looking for a particular white paper or application note, the letters WP or XAPP must be entered with no space before the numbers (silly). White papers contain concepts or ideas that demonstrate Xilinx product capabilities. Application notes illustrate how to use a Xilinx product in a specialized way.

33 Virtex-5 FPGA HDL Coding Techniques
Part 2

34 Curriculum Path ASIC Design
FPGA and ASIC Technology Comparison Intro to VHDL or Intro to Verilog 3 days FPGA and ASIC Technology Comparison Curriculum Path FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion Virtex-5 FPGA Coding Techniques Spartan-3 FPGA Coding Techniques Don’t forget to listen to these FREE RELs… FPGA and ASIC Technology Comparison, Part 2 FPGA vs. ASIC Design Flow ASIC to FPGA Coding Conversion, Part 1 and 2 Virtex-5 FPGA Coding Techniques, Part 1 and 2 Spartan-3 FPGA Coding Techniques, Part 1 and 2 Fundamentals is a very essential course if you are new to FPGA design. I recommend that all customers take this course every 3-5 years, since the tools change every year. for ASIC Design Fundamentals of FPGA Design 1 day Designing for Performance 2 days Advanced FPGA Implementation 2 days

35 Welcome This training will help you build efficient Virtex®-5 FPGA designs that have an efficient size and run at high speed We will show you how to avoid some of the most common design mistakes This content is essential if you have never coded a design for the Virtex-5 FPGA or are converting an ASIC design

36 Objectives After completing this module, you will be able to:
Optimize ASIC code for implementation in a Virtex-5 FPGA Build a checklist of tips for optimizing your code for the Virtex-5 FPGA

37 Clock Enable Control the use of clock enables from the code
Code them only when needed If a low-fanout CE is necessary, use synthesis attributes to control the use of control signals at the signal or module level Do not use global switches to turn off the use of CEs Results in an average of 25-percent LUT increase Consider using alternative coding methods for low-fanout clock enables This will map the CE as an input to the LUT Xilinx is not recommending that you design asynchronously, just that you try to reduce designing low-fanout CEs (fewer than eight registers). VHDL: if (CE) then Q <= A; Verilog: if (CE) Q <= A; VHDL: Q <= ((not CE) AND A) OR (CE AND Q); Verilog: Q <= (~CE & A) | (CE & Q); Tip: Code low-fanout CEs for a LUT input. This will enable the flip-flop to be part of a larger control set

38 Map Report MAP will report on the number of control sets for a particular design (Virtex-5 FPGA only) Running MAP with the -detail switch will give a detailed analysis of the number of unique control signals (can be a large report) Low number of members within a control set are of concern (fewest flip-flops per control set) What is the typical range for the number of control sets? There is no hard number. For smaller devices: up to 1000 control sets. In larger devices: 3000– But these numbers can vary significantly and depend the design being able to fit into a device. If the design does not fit, it is probably one of the first problems to anticipate. MAP will report on the number of control sets for a particular design. But note that if you turn on the -detail switch, multi-megabyte MAP files can result and they are hard to deal with. Part of the detailed information in section 13 is a complete dump of all your control sets for your particular design. From that, you will be able to understand the fanouts for each control set in your design. Most experienced designers use a TCL script to manage the MAP -detail report.

39 Global Clock Enable To gate entire clock domains for power reduction, use the clock-enabled global buffer resource BUGCE For applications that only pause the clock on small areas of the design, use the clock enable pin of the FPGA register Tip: This will save general routing resources

40 DSP Slice Use adder chains instead of adder trees Adder Chain
Adder trees tend to have varying size This usually makes larger adders in the last stages, which increases logic levels The Virtex-5 FPGA uses adder chains which obtain peak performance and use minimal power Requires pipelining Adds latency For more information on this, refer to UG073: Xtreme DSP for Virtex-4 FPGAs User Guide or UG193: Xtreme DSP Design Considerations. Adder Chain Adder Tree Tip: Use adder chains instead of adder trees

41 Block RAM Avoid “read before write” mode for fastest performance
This is easily inferred from your coding style of your memory or by instantiation from the CORE Generator™ tool Synplify and other third-party synthesis tools can insert bypass logic to prevent a possible mismatch error between your RTL and hardware behavior Intended to force RAM outputs to a known value when read and write operations occur on the same memory cell If you know this will never happen you can prevent this logic from being added and damaging your performance with an attribute Attribute syn_ramstyle of mem : signal is “no_rw_check”; Tip: Infer or instantiate the memory that is most appropriate

42 I/O Registers IOB registers provide fixed setup and clock-to-output times Fastest way to capture input data and clock data off the device IOB register can make it difficult to meet internal timing Their use can lengthen route delays to internal logic Only use IOB registers when it is necessary to meet I/O timing It is best to allow your synthesis tool to put registers into IOBs based on timing constraints (if your tool supports this). Otherwise complete the following steps… Disable global I/O register usage in your synthesis tool Disable the Map option to pack registers into IOBs (PAR) Selectively move registers into IOB with a UCF attribute XST does NOT support migrating registers to the IOBs based on timing constraints. Synplify does support this. You can also assign the registers to the IOBs in your HDL or NCF (synthesis constraints file). UCF and NCF Syntax Example INST “instance_name” IOB={TRUE|FALSE}; where • TRUE allows the flip-flop or latch to be pulled into an IOB • FALSE indicates not to pull it into an IOB Example: the following statement instructs the mapper from placing the foo/bar instance into an IOB component. INST “foo/bar” IOB=TRUE; The instance name for each register can be found from the synthesis tools schematic viewer or from a timing report. Also note that –timing (or timing driven packing, another MAP option) does NOT move registers into the IOBs based on the timing constraints. Tip: Only use IOB registers when necessary to meet I/O timing

43 Design Hierarchy Register all inputs and outputs to each hierarchical block Or at least register the outputs Place all I/O components at the top level This includes I/O registers, DDR, SERDES, and delay elements If not, place them in one block of hierarchy Any logic that needs to be placed in a single resource (such as a single DSP slice) should be contained in a single hierarchical block Any logic that needs the synthesis tool to use resource sharing should be placed in a single hierarchical block Manually duplicate registers with high fanout at a hierarchical boundary Tip: Following these guidelines ensures that your design is less likely to interfere with design optimization and incremental design practices

44 Synthesis Options Replicate registers with high fan-out
This allows high fan-out logic to be moved closer to destinations This can be determined from a timing report Manual duplication or replication constraints with the synthesis tools should be applied Retiming option should be used, especially if design has been pipelined Pipelining is still encouraged, but not as essential Synthesis tools react to timing constraints by replicating and making designs bigger, which is how they improve performance. Tip: Duplicate high fan-out logic, pipeline as needed, and if you pipeline use retiming

45 Synthesis Options Overconstraining during synthesis can significantly increase register use Seen as an average increase from 1–5 percent Do NOT over-constrain during synthesis Global optimization can lead to mixed results Can achieve ~10 percent flip-flop reduction Gives back much of that (and sometimes more) due to control signals FSM optimization Turning off FSM optimization can yield a small flip-flop savings One-hot encoding is not as useful Do NOT use slice or LUT compression switches In some cases, latch-thrus are used and consume registers Synthesis tools react to timing constraints by replicating and making designs bigger, which is how they improve performance. Remember that the implementation tools are already using worst-case temperature and voltage, so you already have some built-in slack. Only put the timing constraints you need. Global optimization is widening out your trees in terms of your global fan-in. It also uses your set and resets as a part of that. This allows the use of sets and reset signals as part of the optimizations path, which is great at improving performance. Some designs benefit with this option, other do not. Most designs end up losing registers, but increase the number of control sets. FSM optimization can yield more significant benefits if your design has many FSMs. With the Virtex-5 FPGA, the 6-LUT means one-hot encoding is not quite as useful as it used be with the 4-LUT structure. Often, you can still meet performance by using binary or other types of encoding schemes and still use much fewer registers. Tip: Do NOT over-constrain and do NOT use slice or LUT compression

46 Synthesis Options Summary
To help meet your timing objectives… Turn ON logic replication and retiming Turn OFF resource sharing Turn ON logic optimization (widening deep data paths) Turn OFF FSM optimization Do NOT over constrain during synthesis Do NOT use slice or LUT compression switches These synthesis options make the design larger, but save FFs and give the PAR algorithms more flexibility to meet timing Logic replication is good if you have a high fanout net that can be duplicated. Retiming is great if you have pipelined the design. Resource sharing allows the tools to borrow signals from across the die that have already been made. We would rather have the logic duplicated if necessary. Global logic optimization works with mixed results.

47 Easiest Designs to Migrate to the Virtex-5 FPGA
Designs that can utilize the new hard IP EMAC, DSP slice, block RAM, PowerPC® 440 processor, and PCI™ technology, for example Low-power designs that use the dedicated IP “Slow” designs Designs with several LUT levels generally see greater speed due to the LUT6 and improved routing architecture Tip: Add as much IP to your design as you can

48 Toughest Designs to Migrate to the Virtex-5 FPGA
Structural designs Designs that have not been coded properly (as just discussed) Designs that have NOT been resynthesized Designs that use many old netlists and cores from previous architectures Some types of DSP designs Heavily pipelined designs What is in common? They were not optimized! Tip: Use the coding techniques described in these recorded modules and you will yield the high speed design you hoped

49 Common Questions “Why can’t I code how I want to?”
You can. As long as it is synthesizable (RTL), Xilinx can build it. This module highlights some of the lesser known trade-offs of coding styles in terms of area, power, and performance. “Shouldn’t the tools be able to make my code optimal?” Some coding styles make this more difficult While FPGAs are programmable, the underlying dedicated hardware is fixed

50 Common Questions “The Virtex-5 FPGA should always be a speed grade faster than the Virtex-4 FPGA, right?” No, this is not always true, particularly for heavily pipelined designs. “This design easily fit in the Virtex-4 FPGA and now it can’t fit in the Virtex-5 FPGA. What’s wrong?” Check how many control sets your design has. If you have too many, you may need to evaluate your use of control signals. Also, check that your cores and use of the dedicated hardware is optimal. “Why can’t the software just optimize my inverters across a partition?” Remember that partitions are there to preserve hierarchy and parts of your design. Allowing any tool to selectively remove an option is counterintuitive.

51 Summary Follow our synthesis recommendations…
Turn ON logic replication and retiming Turn OFF resource sharing Turn ON logic optimization (widening deep data paths) Turn OFF FSM optimization Do NOT over constrain during synthesis Do NOT use slice or LUT compression switches Be careful with coding unnecessary clock enables IOB registers can make it more difficult to meet internal timing Follow our directions to use the IOB registers only for IO timing Follow our guidelines to ensure that your design does not interfere with design optimization and incremental design practices

52 Where Can I Learn More? Xilinx Online Documents
support.xilinx.com White papers for reference WP231 – HDL Coding Practices to Accelerate Design Performance WP248 – Retargeting Guidelines for Virtex-5 FPGAs WP275 – Get your Priorities Right – Make your Design Up to 50% Smaller User guides for reference UG193 - Virtex-5 FPGA XtremeDSP Design Considerations Software Manuals (found from the web or the Help menu) Constraints Guide Additional Online Training Note that when looking for a particular white paper or application note, the letters WP or XAPP must be entered with no space before the numbers (silly). White papers contain concepts or ideas that demonstrate Xilinx product capabilities. Application notes illustrate how to use a Xilinx product in a specialized way.

53 Trademark Information
Xilinx is disclosing this Document and Intellectual Property (hereinafter “the Design”) to you for use in the development of designs to operate on, or interface with Xilinx FPGAs. Except as stated herein, none of the Design may be copied, reproduced, distributed, republished, downloaded, displayed, posted, or transmitted in any form or by any means including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior written consent of Xilinx. Any unauthorized use of the Design may violate copyright laws, trademark laws, the laws of privacy and publicity, and communications regulations and statutes. Xilinx does not assume any liability arising out of the application or use of the Design; nor does Xilinx convey any license under its patents, copyrights, or any rights of others. You are responsible for obtaining any rights you may require for your use or implementation of the Design. Xilinx reserves the right to make changes, at any time, to the Design as deemed desirable in the sole discretion of Xilinx. Xilinx assumes no obligation to correct any errors contained herein or to advise you of any correction if such be made. Xilinx will not assume any liability for the accuracy or correctness of any engineering or technical support or assistance provided to you in connection with the Design. THE DESIGN IS PROVIDED “AS IS" WITH ALL FAULTS, AND THE ENTIRE RISK AS TO ITS FUNCTION AND IMPLEMENTATION IS WITH YOU. YOU ACKNOWLEDGE AND AGREE THAT YOU HAVE NOT RELIED ON ANY ORAL OR WRITTEN INFORMATION OR ADVICE, WHETHER GIVEN BY XILINX, OR ITS AGENTS OR EMPLOYEES. XILINX MAKES NO OTHER WARRANTIES, WHETHER EXPRESS, IMPLIED, OR STATUTORY, REGARDING THE DESIGN, INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NONINFRINGEMENT OF THIRD-PARTY RIGHTS. IN NO EVENT WILL XILINX BE LIABLE FOR ANY CONSEQUENTIAL, INDIRECT, EXEMPLARY, SPECIAL, OR INCIDENTAL DAMAGES, INCLUDING ANY LOST DATA AND LOST PROFITS, ARISING FROM OR RELATING TO YOUR USE OF THE DESIGN, EVEN IF YOU HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THE TOTAL CUMULATIVE LIABILITY OF XILINX IN CONNECTION WITH YOUR USE OF THE DESIGN, WHETHER IN CONTRACT OR TORT OR OTHERWISE, WILL IN NO EVENT EXCEED THE AMOUNT OF FEES PAID BY YOU TO XILINX HEREUNDER FOR USE OF THE DESIGN. YOU ACKNOWLEDGE THAT THE FEES, IF ANY, REFLECT THE ALLOCATION OF RISK SET FORTH IN THIS AGREEMENT AND THAT XILINX WOULD NOT MAKE AVAILABLE THE DESIGN TO YOU WITHOUT THESE LIMITATIONS OF LIABILITY. The Design is not designed or intended for use in the development of on-line control equipment in hazardous environments requiring fail-safe controls, such as in the operation of nuclear facilities, aircraft navigation or communications systems, air traffic control, life support, or weapons systems (“High-Risk Applications”). Xilinx specifically disclaims any express or implied warranties of fitness for such High-Risk Applications. You represent that use of the Design in such High-Risk Applications is fully at your risk. © 2012 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.


Download ppt "Virtex-5 FPGA HDL Coding Techniques"

Similar presentations


Ads by Google