Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary.

Slides:



Advertisements
Similar presentations
What are FPGA Power Management HDL Coding Techniques Xilinx Training.
Advertisements

Basic HDL Coding Techniques
VERILOG: Synthesis - Combinational Logic Combination logic function can be expressed as: logic_output(t) = f(logic_inputs(t)) Rules Avoid technology dependent.
Spartan-3 FPGA HDL Coding Techniques
Combinational Logic.
Table 7.1 Verilog Operators.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
Xilinx/Exemplar Logic FPGA Synthesis Solution. LeonardoSpectrum Powerful Integrated Modular ASIC & FPGA.
Verilog - 1 Writing Hardware Programs in Abstract Verilog  Abstract Verilog is a language with special semantics  Allows fine-grained parallelism to.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
ELEN 468 Lecture 151 ELEN 468 Advanced Logic Design Lecture 15 Synthesis of Language Construct I.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
FPGAs and VHDL Lecture L12.1. FPGAs and VHDL Field Programmable Gate Arrays (FPGAs) VHDL –2 x 1 MUX –4 x 1 MUX –An Adder –Binary-to-BCD Converter –A Register.
Programmable logic and FPGA
FPGAs and VHDL Lecture L13.1 Sections 13.1 – 13.3.
CSE241 RTL Performance.1Kahng & Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Recitation 2.5: Performance Coding.
ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.
ECEN ECEN475 Introduction to VLSI System Design Verilog HDL.
ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.
Memory in FPGAs مرتضي صاحب الزماني. Inferring Memory Inferring Memory in XST:  Distributed or block memory? −XST implements small RAM components on distributed.
Overview Logistics Last lecture Today HW5 due today
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR HDL coding n Synthesis vs. simulation semantics n Syntax-directed translation n.
Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.
Synthesis Presented by: Ms. Sangeeta L. Mahaddalkar ME(Microelectronics) Sem II Subject: Subject:ASIC Design and FPGA.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
FPGA Design Flow Workshop
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
© 2003 Xilinx, Inc. All Rights Reserved Synchronous Design Techniques.
ENG241 Digital Design Week #8 Registers and Counters.
Slide 1 6. VHDL/Verilog Behavioral Description. Slide 2 Verilog for Synthesis: Behavioral description Instead of instantiating components, describe them.
CPE 626 Advanced VLSI Design Lecture 6: VHDL Synthesis Aleksandar Milenkovic
Verilog for Synthesis Ing. Pullini Antonio
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Design Methodology 1 Design Methodology for High-Density FPGA Design Selecting an Architecture High-Density Software Methodology Implementation and Integration.
Slide 1 2. Verilog Elements. Slide 2 Why (V)HDL? (VHDL, Verilog etc.), Karen Parnell, Nick Mehta, “Programmable Logic Design Quick Start Handbook”, Xilinx.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
This material exempt per Department of Commerce license exception TSU Synchronous Design Techniques.
ELEN 468 Lecture 131 ELEN 468 Advanced Logic Design Lecture 13 Synthesis of Combinational Logic II.
CSCE 211: Digital Logic Design Chin-Tser Huang University of South Carolina.
Introduction to ASIC flow and Verilog HDL
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
Introduction to Verilog
Slide 1 3.VHDL/Verilog Description Elements. Slide 2 To create a digital component, we start with…? The component’s interface signals Defined in MODULE.
George Mason University Behavioral Modeling of Sequential-Circuit Building Blocks ECE 545 Lecture 8.
Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization.
© 2005 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Synthesis Techniques.
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
Hardware Description Languages: Verilog
Registers and Counters
ELEN 468 Advanced Logic Design
CSE241A VLSI Digital Circuits Winter 2003 Recitation 2
EMT 351/4 DIGITAL IC DESIGN Week # Synthesis of Sequential Logic 10.
Advance Skills TYWu.
Hardware Description Languages: Verilog
Introduction to DIGITAL CIRCUITS MODELING & VERIFICATION using VERILOG [Part-I]
Topics HDL coding for synthesis. Verilog. VHDL..
Field Programmable Gate Array
The Xilinx Virtex Series FPGA
RTL Style در RTL مدار ترتيبي به دو بخش (تركيبي و عناصر حافظه) تقسيم مي شود. مي توان براي هر بخش يك پروسس نوشت يا براي هر دو فقط يك پروسس نوشت. مرتضي صاحب.
ECE 551: Digital System Design & Synthesis
SYNTHESIS OF SEQUENTIAL LOGIC
FSM MODELING MOORE FSM MELAY FSM. Introduction to DIGITAL CIRCUITS MODELING & VERIFICATION using VERILOG [Part-2]
FPGA Tools Course Answers
Verilog Synthesis Synthesis vs. Compilation
Win with HDL Slide 4 System Level Design
The Xilinx Virtex Series FPGA
Sequntial-Circuit Building Blocks
Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016
Presentation transcript:

Winning with HDL

AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary

Coding for Performance  Gate Arrays are relatively tolerant of poor coding styles and design practices 66 MHz is easy for an Gate Array  Designs coded for a Gate Array tend to perform 3x slower when converted to a FPGA Not uncommon to see up to 30 layers of logic and MHz FPGA designs 6-8 FPGA Logic Levels = 50 MHz  FPGAs require different coding styles and more effective design methodologies to reach gate array system speeds.

Coding for Performance  Common mistake is to ignore hardware and start coding as if programming. To achieve best performance, the designer must think about the hardware.  Improve performance by: avoiding unnecessary priority structures in logic optimizing logic for late-arriving signals structuring arithmetic for performance avoiding area-inefficient code buffering high fanout signals pipelining for high performance exploiting high performance cores from CoreGen

Effective Coding Style Case vs. If-Then-Else in0 in1 in2 in3 mux_out sel in0 in1 in2 in3 sel=00 sel=01 sel=10 p_encoder_out module mux (in0, in1, in2, in3, sel, mux_out); inputin0, in1, in2, in3; input[1:0] sel; outputmux_out; regmux_out; or in1 or in2 or in3 or sel) begin case (sel) 2'b00:mux_out = in0; 2'b01:mux_out = in1; 2'b10:mux_out = in2; default:mux_out = in3; endcase end endmodule module p_encoder (in0, in1, in2, in3, sel, p_encoder_out); inputin0, in1, in2, in3; input[1:0] sel; outputp_encoder_out; regp_encoder_out; or in1 or in2 or in3 or sel) begin if (sel == 2'b00) p_encoder_out = in0; else if (sel == 2'b01) p_encoder_out = in1; else if (sel == 2'b10) p_encoder_out = in2; else p_encoder_out = in3; end endmodule Generally, If-Else is slower unless you intend to build a priority encoder!

Priority Encoder “if-then-else” When to use?  Assign highest priority to a late arriving critical signal  Nested “if-then-else” can increase area and delay  Use “case” statement if possible to describe the same function or in) begin if (sel == 3'h0) out = in[0]; else if (sel == 3'h1) out = in[1]; else if (sel == 3'h2) out = in[2]; else if (sel == 3'h3) out = in[3]; else if (sel == 3'h4) out = in[4]; else out = in[5]; end in [4] in [3] S S S S in [2] in [1] in [0]

Benefits of “case” statement or D or E or F or S) begin case (S) 2’b000 : Z = C; 2’b001 : Z = D; 2’b010 : Z = E; 2’b011 : Z = F; 2’b100 : Z = G; 2’b101 : Z = H; 2’b110 : Z = I; default : Z = J; endcase C D E F G H I J S Z 8:1 Mux  Compact and delay optimized implementation Implemented in a single CLB  Synthesis maps to MUXF5 and MUXF6 functions  4:1 MUX is implemented in a single CLB slice

Effective Coding Style Optimize for the Critical Path critical in0 in1 in2 in3 out in2 in0 in1 in3 critical out module critical_bad (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = (((in0&in1) & ~critical) | ~in2) & ~in3; endmodule module critical_good (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = ((in0&in1) | ~in2) & ~in3 & ~critical; endmodule Minimize the critical path where possible

-- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) I1 I2 I3 I4 OUT1 I4 I1 I2 I3 OUT1 Structuring Arithmetic for Performance  Know your tools: use Synthesis directives, options (vendor specific) Area, Speed, Ungrouping and flattening, Resource sharing, "DesignWare" libraries Attributes - ripple, look-ahead, fastest, smallest. –i.e. // exemplar attribute out1 modgen_sel fastest LogiBlox, CORE Generator if vendor hasn't fully tuned yet  Use parentheses to control logical structure

How to use the Carry-In in FPGA Express  In FPGA Express, concatenate the Carry-In to get an adder with carry (Adder_c). Without concatenation (Adder_b), you would end up with 2 adders.  In other tools, like Leonardo, Adder_b will generate a single adder with carry-in -- no concatenation is necessary. // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN}; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN};

Verilog Notes  For CASE statements, be sure to use your synthesis vendor’s syntax to ensure optimum performance. Full_case syntax allows you to avoid unwanted latches Parallel_case syntax allows you to ensure a parallel (as opposed to priority encoded) hardware implementation in case statements where all cases are mutually exclusive.  Use “Don’t-Cares” to speed up your design and reduce area

Avoid inefficient code a0 b0 + + a1 b1 sum sel + sum sel a0 a1 b0 b1 module poor_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; or a1 or b0 or b1 or sel) begin if (sel) sum = a1 + b1; else sum = a0 + b0; end endmodule module good_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; rega_temp, b_temp; or a1 or b0 or b1 or sel) begin if (sel) begin a_temp = a1; b_temp = b1; end else begin a_temp = a0; b_temp = b0; end sum = a_temp + b_temp; end endmodule Use 2 muxes rather than 2 adders to reduce resource usage

Duplicate Registers to Reduce Fan-Out module low_fanout(in, en, clk, out); input[23:0] in; input en, clk; output[23:0] out; reg[23:0] out; regtri_en1, tri_en2; clk) begin tri_en1 = en; tri_en2 = en; end or in)begin if (tri_en1) out[23:12] = in[23:12]; else out[23:12] = 12'bZ; end or in) begin if (tri_en2) out[11:0] = in[11:0]; else out[11:0] = 12'bZ; end endmodule module high_fanout(in, en, clk, out); input[23:0]in; inputen, clk; output[23:0] out; reg[23:0] out; regtri_en; clk) tri_en = en; or in) begin if (tri_en) out = in; else out = 24'bZ; end endmodule en clk [23:0]in[23:0]out tri_en en clk [23:0]in [23:0]out en clk 24 loads 12 loads tri_en1 tri_en2

Design Partition - Reg at Boundary a0 clk a1 clk + sum + a0 a1 clk sum module reg_at_boundary (a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; clk) begin sum = a0 + a1; end endmodule module reg_in_module(a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; reg a0_temp, a1_temp; clk) begin a0_temp = a0; a1_temp = a1; end or a1_temp) begin sum = a0_temp + a1_temp; end endmodule

Pipeline for Performance 1 cycle module no_pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp; clk) begin out = (a_temp * b_temp) + c_temp; a_temp = a; b_temp = b; c_temp = c; end endmodule module pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp1, c_temp2, mult_temp; clk) begin mult_temp = a_temp * b_temp; a_temp = a; b_temp = b; end clk) begin out = mult_temp + c_temp2; c_temp2 = c_temp1; c_temp1 = c; end endmodule * + a b c out 2 cycle * + a b c out Pipeline to increase performance

Take Advantage of Virtex Hardware  Use flip-flops and pipeline! FPGA’s contain hordes of flip-flops.  Virtex gives you 4 DLL’s that can be used to synchronize clocks for superior system timing  Use the optimized cores from CoreGen to get high performance, pipelined arithmetic and sophisticated functional blocks.

RTL Flexibility for Register Configurations Register Mapping for  Registers with sync/async set and reset  Clocks, inverted clocks, and clock enable Positive Edge Triggered Flip-Flop with clock enable, sync clear and preset clk or posedge preset) begin if (preset) q = 1; else if (reset) q = 0; else if (CE) q = data; end reset data clk q preset ce

Timing Driven Register IOB Mapping  Technology Mapping will not duplicate registers  Critical signal will not be absorbed in the IOB register process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if; end process; process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out ‘Z’); end if; end process; TRI TRI_R CLK DQ DATA [23:0] OUT [23:0] fanout = 24

Timing Driven Register IOB Mapping Duplicate register on critical path for fanout of 1 Mapping will absorb register in IOB process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process; process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if; end process; process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then Out(22:0) <= Data_in(22:0); else Out(22:0) ‘Z’); end if; end process; TRI CLK DQ TRI_R1 DATA [23] OUT [23] fanout = 1 TRI CLK D Q TRI_R2 OUT [22:0] DATA [22:0] fanout = 23

Area Efficient Muxes using TBUFs Improve area efficiency by using tri-states Each CLB has 2 TBUFs assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z; assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z; assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z; assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z; case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0]; endcase E[3:0] A[7:0] B[7:0] C[7:0] D[7:0] Z[7:0] A[7:0] B[7:0] C[7:0] D[7:0] E0 E1 E2 E3 Z[7:0]

TBUFs as Muxes Performance Summary Improve area efficiency by using tri-states But often slower than equivalent muxes under most circumstance Too much delay getting onto TBUF Each CLB has 2 TBUFs PAR can connect tri-states on multiple horizontal long lines to build wide muxes

Distributed RAM Inferencing System Memory module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk; reg [3:0] mem [7:0]; assign q = mem[addr]; clk) begin if(we) mem[addr] = d; end endmodule Synplicity (RAM 8x4) AO A1 A2 A3 D WCLK WE AO A1 A2 D WCLK WE Addr [2:0] D [3:0] clk we q [3:0] RAM 16x1S Q..... Synplify and LeonardoSpectrum can infer distributed RAM FPGA Express will support RAM inferencing in future

Registered IO Mapping System Interfaces  System Timing Chip to chip performance limits system speeds  No need to instantiate IOB register cells Implementation tools will pack registers in the IO map -pr b b (both input and output) i (input only) o (output only) IOB = TRUE attribute  Mapping for data and enable ports S/R D CE CLK S/R Q OBUF Q CE D CLK IBUF

Instantiating Technology Specific Features  Block RAM System Memory  CLKDLL Minimizes clock skew  Special IOs Interfacing with standard buses  LUTs for Datapath pipelining Add latency with minimal area impact

LUTs for Datapath pipelining  LUT can be used in place of registers to balance pipeline stages Area efficient implementation  SRL16E can delay an input value up to 16 clock cycles - Sync up operands before the next operation F G H A[31:0] B[31:0] C[31:0] Z 8 cycles 5 cycles 1 cycle SRL16E D CE CLK A3 A2 A1 A0 Q 7 SRL16E D CE CLK A3 A2 A1 A0 Q LUTs replace 256 registers 32 LUTs replace 416 registers

Block RAM: System Memory RAMB4_S1 U1 (.WE(WE),.EN(EN),.RST(RST),.CLK(CLK),.ADDR(ADDR),.DI(DI),.DO(DO)); component RAMb4_S1 port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0)); end component; begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO); RAMB4_S1 do DO addr en we rst clk di ADDR WE EN RST DI CLK  Instantiate single and dual port RAM  Use CoreGen to build RAM and FIFO (Q1 ‘99)

wire clk_fb; BUFGDLL U4 (.I(clkin),.O(clk_fb)); BUFG CLKIN CLKFB RST CLKDLL CLK0 CLK90 CLK180 CLK270 CLK2X CLKDV LOCKED IBUFG U4 clkin rst clk_fb I O Virtex CLKDLL  Minimize clock to out pad delay Removes all delay from external GCLKPAD pin to the registers and RAM  BUFGDLL is available for instantiation Other configurations can be built by instantiating the CLKDLL macro  UCF only way to configure CLKDLL or BUFGDLL In future would like to use generics (VHDL) and parameters (Verilog) but synthesizers don't pass them on yet

Special IO Buffers: System Interfaces  Default IO buffer is LVTTL (12mA), available via inference Process technology leads to mixed voltage systems High performance, low power signal standards emerging  Instantiate IO buffers for non default current drive non default voltage standard non default slew OBUF_AGP U0 (.I(awire),.O(oport)); OBUF_F_24 U1 (.I(awire),.O(oport)); awire oport U0 awire oport U1 Advanced Graphics Port bus interface (Pentium II graphics) Fast slew rate and 24 mA drive strength

Summary  Efficient HDL coding allows designers to build high performance designs  Designers should consider the underlying hardware as they code, to achieve best results  Exploit the hardware’s features for best performance