Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016

Slides:



Advertisements
Similar presentations
Basic HDL Coding Techniques
Advertisements

VERILOG: Synthesis - Combinational Logic Combination logic function can be expressed as: logic_output(t) = f(logic_inputs(t)) Rules Avoid technology dependent.
Verilog Fundamentals Shubham Singh Junior Undergrad. Electrical Engineering.
//HDL Example 8-2 // //RTL description of design example (Fig.8-9) module Example_RTL (S,CLK,Clr,E,F,A);
Spartan-3 FPGA HDL Coding Techniques
Verilog Overview. University of Jordan Computer Engineering Department CPE 439: Computer Design Lab.
ECE 551 Digital Design And Synthesis
Combinational Logic.
A Digital Circuit Toolbox
Table 7.1 Verilog Operators.
//HDL Example 6-1 // //Behavioral description of //Universal shift register // Fig. 6-7 and Table 6-3 module shftreg.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
1 COMP541 Sequential Circuits Montek Singh Sep 17, 2014.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
Useful Things to Know Norm. Administrative Midterm Grading Finished –Stats on course homepage –Pickup after this lab lec. –Regrade requests within 1wk.
Programmable logic and FPGA
ELEN 468 Advanced Logic Design
Advanced Verilog EECS 270 v10/23/06.
Overview Logistics Last lecture Today HW5 due today
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
Global Timing Constraints FPGA Design Workshop. Objectives  Apply timing constraints to a simple synchronous design  Specify global timing constraints.
Registers CPE 49 RMUTI KOTAT.
Department of Communication Engineering, NCTU 1 Unit 5 Programmable Logic and Storage Devices – RAMs and FPGAs.
Basic Sequential Components CT101 – Computing Systems Organization.
ENG241 Digital Design Week #8 Registers and Counters.
Slide 1 6. VHDL/Verilog Behavioral Description. Slide 2 Verilog for Synthesis: Behavioral description Instead of instantiating components, describe them.
ECE 448: Lab 6 DSP and FPGA Embedded Resources (Digital Downconverter)
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
CSCE 211: Digital Logic Design Chin-Tser Huang University of South Carolina.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU 99-1 Under-Graduate Project Design of Datapath Controllers Speaker: Shao-Wei Feng Adviser:
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
Overview Logistics Last lecture Today HW5 due today
Hardware Description Languages: Verilog
Sequential Logic Design
Verilog Tutorial Fall
Supplement on Verilog FF circuit examples
Class Exercise 1B.
Figure 8.1. The general form of a sequential circuit.
Lecture 15 Sequential Circuit Design
Last Lecture Talked about combinational logic always statements. e.g.,
EMT 351/4 DIGITAL IC DESIGN Week # Synthesis of Sequential Logic 10.
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Each I/O pin may be configured as either input or output.
HDL for Sequential Circuits
Advance Skills TYWu.
Hardware Description Languages: Verilog
CS Fall 2005 – Lec. #5 – Sequential Logic - 1
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
Chapter 9: Sequential Logic Modules
  State Encoding مرتضي صاحب الزماني.
RTL Style در RTL مدار ترتيبي به دو بخش (تركيبي و عناصر حافظه) تقسيم مي شود. مي توان براي هر بخش يك پروسس نوشت يا براي هر دو فقط يك پروسس نوشت. مرتضي صاحب.
ECE 551: Digital System Design & Synthesis
SYNTHESIS OF SEQUENTIAL LOGIC
FSM MODELING MOORE FSM MELAY FSM. Introduction to DIGITAL CIRCUITS MODELING & VERIFICATION using VERILOG [Part-2]
FPGA Tools Course Answers
Programmable Logic- How do they do that?
CSE 370 – Winter Sequential Logic-2 - 1
Data Flow Description of Combinational-Circuit Building Blocks
Win with HDL Slide 4 System Level Design
The Xilinx Virtex Series FPGA
Dr. Tassadaq Hussain Introduction to Verilog – Part-3 Expressing Sequential Circuits and FSM.
Data Flow Description of Combinational-Circuit Building Blocks
The Verilog Hardware Description Language
ECE 551: Digital System Design & Synthesis
Dr. Tassadaq Hussain Introduction to Verilog – Part-4 Expressing FSM in Verilog (contd) and FSM State Encoding Dr. Tassadaq Hussain.
Presentation transcript:

Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016 Copyright © 2016 Flex Logix Technologies, Inc.

Introduction Customers using the EFLX compiler for the EFLX arrays have encountered the following with their RTL: Their RTL compiles without any changes or optimizations required Their RTL needs to have changes made because it doesn’t compile Their RTL needs to have changes made because of either the array resources required are too high or the performance is too low This application note addresses the last 2 issues Things to avoid in RTL in order to compile Optimizations in RTL for better utilization and/or performance

EFLX Core Architecture Overview Logic RBB 4 dual 4-input LUTs 8 Flip Flops Carry chain selectable clocks DSP RBB MAC 22-bit pre adder 22x22 multiplier 48-bit post adder (accumulator) I/O RBB 2 inputs, 2 outputs Inputs and outputs can be pass-through or flopped

Verilog Coding for Embedded FPGAs Summary Things to avoid in RTL in order to compile Avoid using latches. Use D flip flops whenever possible Four State values not supported ‘0’, ‘1’, ‘x’, ‘z’ Only ‘0’ and ‘1’ supported Inout data type or tristate buffers not supported Optimizations in RTL for better utilization and/or performance Maintain Synchronous Sub-Blocks by Registering All Outputs Use registered inputs and outputs to the embedded FPGA fabric Use synchronous resets instead of asynchronous Use clock enables instead of multiple clocks Use 1-hot encoding instead of binary encoding for state machines Implement proper synchronization of all asynchronous signals Optimize Code for Embedded FPGA fabric 4

Things to avoid in RTL in order to compile

Inadvertent Latch Inference via incomplete combinatorial process In Verilog, if all conditions are not specified, a latch will be inferred Latches are not supported in the embedded FPGA fabric and will yield unpredictable results 6

Not supported language constructs Four State values within the FPGA fabric not supported (i.e. ‘0’, ‘1’, ‘x’, ‘z’) Only ‘0’ and ‘1’ supported in the fabric Bidirectional buffers need to be generated outside the FPGA fabric Avoid using inout datatype which will incur bidirectional IOBUF instances that are not supported in the fabric Muxes used instead of Hi-Z buffers No Hi-Z Buffers OEn Sel Sel FPGA Fabric In_A Dout In_A Out Out In_B 1 Din I/O Pads In_B Mux Bidirectional Buffer 7

Don’t Use IOBUF or BUFT FPGA Fabric OEn Dout Output Dout The EFLX compiler will treat BUFT as a regular buffer ignoring OE BUFT Input Din FPGA Fabric module test ( output bidirPin2, inout bidirPin, input toPin, input direction, output fromPin ); assign bidirPin2= direction? 1'bz: toPin; assign bidirPin = direction ? 1'bz : toPin; assign fromPin = bidirPin; endmodule BUFT IOBUF

Optimizations in RTL for better utilization and/or performance

Maintain Synchronous Sub-Blocks by Registering All Outputs Arrange the design boundary so that the outputs in each block are registered. Registering outputs helps the synthesis tool implement the combinatorial logic and registers in the same logic block. Registering outputs also makes the application of timing constraints easier since it eliminates possible problems with logic optimization across design boundaries. Using a single clock for each synchronous block significantly reduces the timing consideration in the block. Synchronous Blocks with Registered Outputs Merging Sharable Resource in the Same Block

Use Registered Inputs/Outputs Registered inputs and outputs decouples the FPGA fabric timing from the SOC Registered inputs and outputs also improves performance due to pipelined design EFLX Array Din D Q FPGA Fabric Dout Q D CLK

Synchronous resets Use Synchronous reset instead of asynchronous resets whenever possible Using asynchronous reset may result in sub-optimal logic utilization This is due to LUT/FF packing not able to optimize due to asynchronous reset Asynchronous resets can cause synchronization problems or glitches No reset scheme also possible by using CHIP_RST, initializing registers with ‘0’ on FPGA fabric during power-up Asynchronous Reset Synchronous Reset Always @ (posedge CLK or posedge RESET) If (RESET) Dout<=0; else Dout<=Din; Always @ (posedge CLK) If (RESET) Dout<=0; else Dout<=Din; Din Din Dout Dout D Q D Q RESET CLK CLK CLR CLR Resets FF in RBB that is defined by user RTL RESET Resets all FFs in RBB 12

Use clock enables instead of multiple clocks Avoid using gated, derived, or divided clocks Use clock enables instead of multiple clocks Multiple clock domains cause higher LUT/FF utilization Example: dout clocked out at clk divided by 256 Derived Clock, 2 clock domains Using Clock Enable, 1 clock domain module derived_clock (clk, din, dout); input clk; input [7:0] din; output [7:0] dout; reg [7:0] temp; reg [7:0] dout; always @(posedge clk) temp <=temp+1; always @(negedge temp[7]) dout<=din; endmodule module derived_clock (clk, din, dout); input clk; input [7:0] din; output [7:0] dout; reg [7:0] temp; reg [7:0] dout; always @(posedge clk) temp <=temp+1; if (temp==8’hff) dout<=din; endmodule 11 LUTs 25 FFs 18% LUT reduction 9 LUTs 23 FFs 13

binary state & cmd encoding 1-hot state & cmd encoding Use 1-Hot Encoding 1-hot encoding takes less EFLX array resources than using binary encoding, given the same compiler options. Binary Encoding 1-Hot Encoding Resource comparison for an I2C Controller binary state & cmd encoding 1-hot state & cmd encoding Logic RBBs 94 88 RBB LUTs 256 244 Total LUTs 284 263 RBB FFs 155 153 Total FFs 162 161 IO RBBs 17 Number of Tiles Needed 4 3 Floorplan spec 2x2 3x1 Frequency (MHz) 31.90 31.97 ~7.5% less LUTs & 1 less tile

Implement proper synchronization of all asynchronous signals Asynchronous signals need to be synchronized before used by FPGA fabric Double flop async signals for demetastabilization EFLX compiler will place demetastable flops close together Asynchronous Domain Crossing Demetastabilization Flip Flops Din Dout’ Dout D Q D Q Asynchronous domain Reset of FPGA Fabric CLK Synchronous CLR Always @ (posedge CLK) Dout’<=Din; Dout<=Dout’; 15

Optimizing RTL Code Two methods of optimizing RTL to take advantage of EFLX fabric Rewriting RTL code to fit into fabric in optimal way Using the reconfigurability of the EFLX fabric to minimize configuration/mode registers and functional modes 16

Optimizing code example Non Optimized Serial-to-Parallel Code Optimized Serial-to-Parallel Code 1 mux per D-input 8 muxes total 8 FFs 6 RBB LUTs 0 muxes 8 FFs 4 RBB LUTs 33% LUT reduction Simpler Logic uses less LUTs 17

32-bit FIR Filter RTL Code- Non Optimized 5 Tap FIR Filter Input X(n) C2 C3 C4 C5 X C1 X X X X Output Y(n) Z-1 + Z-1 + Z-1 + Z-1 + Y[n]=C1*X[n] + C2*X[n-1] + C3*X[n-2] + C4*X[n-3] + C5*X[n-4] Utilizes 10 DSP RBBs Plus additional 284 Logic RBBs Additional Logic RBBs required due to 32bit*16bit multiplier not fitting into 22bit*22bit DSP MAC 18

32-Bit FIR Filter RTL Code- Optimized Optimizations of the RTL code to better fit into the EFLX DSP MAC structure (i.e. 22bit*22bit=48bit) Utilizes 10 DSP RBBs and only 8 Logic RBBs Splitting the 32 bit data into upper word and lower word Shifting the upper word accumulator left by 16 and adding to lower word accumulator Result is 48-bits 19

2.2x reduction of total tiles required EFLX Utilization 32-bit FIR Non Optimized 32-bit FIR Optimized DSP RBBs used 10 Logic RBB used 284 8 LUTs used 192 32 RBB FF used 493 48 DSP tiles used 5 Logic tiles- used 6 Total tiles used 11 2.2x reduction of total tiles required 20

FIR Filter with programmable Coefficient Registers Programmable Coefficient Registers take additional LUTs 21

FIR Filter with Coefficients Configured in the EFLX Fabric Registers Not Required Use Programmability of FPGA for Coefficient values Coefficients as assign statements 22

LUT Savings 16-bit FIR with programmable Coefficient Registers 16-bit FIR with Coefficients configured in the fabric DSP RBBs used 5 Logic RBB used 28 RBB LUTs used 52 RBB FF used 80 DSP tiles used 3 Total Available RBB LUTs 264 LUT Utilization 20% 0% 20% LUT utilization reduction 23