Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary.

Winning with HDL

AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary

Coding for Performance  Gate Arrays are relatively tolerant of poor coding styles and design practices 66 MHz is easy for an Gate Array  Designs coded for a Gate Array tend to perform 3x slower when converted to a FPGA Not uncommon to see up to 30 layers of logic and 10-20 MHz FPGA designs 6-8 FPGA Logic Levels = 50 MHz  FPGAs require different coding styles and more effective design methodologies to reach gate array system speeds.

Coding for Performance  Common mistake is to ignore hardware and start coding as if programming. To achieve best performance, the designer must think about the hardware.  Improve performance by: avoiding unnecessary priority structures in logic optimizing logic for late-arriving signals structuring arithmetic for performance avoiding area-inefficient code buffering high fanout signals pipelining for high performance exploiting high performance cores from CoreGen

Effective Coding Style Case vs. If-Then-Else in0 in1 in2 in3 mux_out sel in0 in1 in2 in3 sel=00 sel=01 sel=10 p_encoder_out module mux (in0, in1, in2, in3, sel, mux_out); inputin0, in1, in2, in3; input[1:0] sel; outputmux_out; regmux_out; always @(in0 or in1 or in2 or in3 or sel) begin case (sel) 2'b00:mux_out = in0; 2'b01:mux_out = in1; 2'b10:mux_out = in2; default:mux_out = in3; endcase end endmodule module p_encoder (in0, in1, in2, in3, sel, p_encoder_out); inputin0, in1, in2, in3; input[1:0] sel; outputp_encoder_out; regp_encoder_out; always @(in0 or in1 or in2 or in3 or sel) begin if (sel == 2'b00) p_encoder_out = in0; else if (sel == 2'b01) p_encoder_out = in1; else if (sel == 2'b10) p_encoder_out = in2; else p_encoder_out = in3; end endmodule Generally, If-Else is slower unless you intend to build a priority encoder!

Priority Encoder “if-then-else” When to use?  Assign highest priority to a late arriving critical signal  Nested “if-then-else” can increase area and delay  Use “case” statement if possible to describe the same function always @(sel or in) begin if (sel == 3'h0) out = in[0]; else if (sel == 3'h1) out = in[1]; else if (sel == 3'h2) out = in[2]; else if (sel == 3'h3) out = in[3]; else if (sel == 3'h4) out = in[4]; else out = in[5]; end in [4] in [3] S S S S in [2] in [1] in [0]

Benefits of “case” statement always @(C or D or E or F or S) begin case (S) 2’b000 : Z = C; 2’b001 : Z = D; 2’b010 : Z = E; 2’b011 : Z = F; 2’b100 : Z = G; 2’b101 : Z = H; 2’b110 : Z = I; default : Z = J; endcase C D E F G H I J S Z 8:1 Mux  Compact and delay optimized implementation Implemented in a single CLB  Synthesis maps to MUXF5 and MUXF6 functions  4:1 MUX is implemented in a single CLB slice

Effective Coding Style Optimize for the Critical Path critical in0 in1 in2 in3 out in2 in0 in1 in3 critical out module critical_bad (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = (((in0&in1) & ~critical) | ~in2) & ~in3; endmodule module critical_good (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = ((in0&in1) | ~in2) & ~in3 & ~critical; endmodule Minimize the critical path where possible

-- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) I1 I2 I3 I4 OUT1 I4 I1 I2 I3 OUT1 Structuring Arithmetic for Performance  Know your tools: use Synthesis directives, options (vendor specific) Area, Speed, Ungrouping and flattening, Resource sharing, "DesignWare" libraries Attributes - ripple, look-ahead, fastest, smallest. –i.e. // exemplar attribute out1 modgen_sel fastest LogiBlox, CORE Generator if vendor hasn't fully tuned yet  Use parentheses to control logical structure

How to use the Carry-In in FPGA Express  In FPGA Express, concatenate the Carry-In to get an adder with carry (Adder_c). Without concatenation (Adder_b), you would end up with 2 adders.  In other tools, like Leonardo, Adder_b will generate a single adder with carry-in -- no concatenation is necessary. // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN}; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN};

Verilog Notes  For CASE statements, be sure to use your synthesis vendor’s syntax to ensure optimum performance. Full_case syntax allows you to avoid unwanted latches Parallel_case syntax allows you to ensure a parallel (as opposed to priority encoded) hardware implementation in case statements where all cases are mutually exclusive.  Use “Don’t-Cares” to speed up your design and reduce area

Avoid inefficient code a0 b0 + + a1 b1 sum sel + sum sel a0 a1 b0 b1 module poor_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; always @(a0 or a1 or b0 or b1 or sel) begin if (sel) sum = a1 + b1; else sum = a0 + b0; end endmodule module good_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; rega_temp, b_temp; always @(a0 or a1 or b0 or b1 or sel) begin if (sel) begin a_temp = a1; b_temp = b1; end else begin a_temp = a0; b_temp = b0; end sum = a_temp + b_temp; end endmodule Use 2 muxes rather than 2 adders to reduce resource usage

Duplicate Registers to Reduce Fan-Out module low_fanout(in, en, clk, out); input[23:0] in; input en, clk; output[23:0] out; reg[23:0] out; regtri_en1, tri_en2; always @(posedge clk) begin tri_en1 = en; tri_en2 = en; end always @(tri_en1 or in)begin if (tri_en1) out[23:12] = in[23:12]; else out[23:12] = 12'bZ; end always @(tri_en2 or in) begin if (tri_en2) out[11:0] = in[11:0]; else out[11:0] = 12'bZ; end endmodule module high_fanout(in, en, clk, out); input[23:0]in; inputen, clk; output[23:0] out; reg[23:0] out; regtri_en; always @(posedge clk) tri_en = en; always @(tri_en or in) begin if (tri_en) out = in; else out = 24'bZ; end endmodule en clk [23:0]in[23:0]out tri_en en clk [23:0]in [23:0]out en clk 24 loads 12 loads tri_en1 tri_en2

Design Partition - Reg at Boundary a0 clk a1 clk + sum + a0 a1 clk sum module reg_at_boundary (a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; always @(posedge clk) begin sum = a0 + a1; end endmodule module reg_in_module(a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; reg a0_temp, a1_temp; always @(posedge clk) begin a0_temp = a0; a1_temp = a1; end always @(a0_temp or a1_temp) begin sum = a0_temp + a1_temp; end endmodule

Pipeline for Performance 1 cycle module no_pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp; always @(posedge clk) begin out = (a_temp * b_temp) + c_temp; a_temp = a; b_temp = b; c_temp = c; end endmodule module pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp1, c_temp2, mult_temp; always @(posedge clk) begin mult_temp = a_temp * b_temp; a_temp = a; b_temp = b; end always @(posedge clk) begin out = mult_temp + c_temp2; c_temp2 = c_temp1; c_temp1 = c; end endmodule * + a b c out 2 cycle * + a b c out Pipeline to increase performance

Take Advantage of Virtex Hardware  Use flip-flops and pipeline! FPGA’s contain hordes of flip-flops.  Virtex gives you 4 DLL’s that can be used to synchronize clocks for superior system timing  Use the optimized cores from CoreGen to get high performance, pipelined arithmetic and sophisticated functional blocks.

RTL Flexibility for Register Configurations Register Mapping for  Registers with sync/async set and reset  Clocks, inverted clocks, and clock enable Positive Edge Triggered Flip-Flop with clock enable, sync clear and preset always @(posedge clk or posedge preset) begin if (preset) q = 1; else if (reset) q = 0; else if (CE) q = data; end reset data clk q preset ce

Timing Driven Register IOB Mapping  Technology Mapping will not duplicate registers  Critical signal will not be absorbed in the IOB register process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if; end process; process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out ‘Z’); end if; end process; TRI TRI_R CLK DQ DATA [23:0] OUT [23:0] fanout = 24

Timing Driven Register IOB Mapping Duplicate register on critical path for fanout of 1 Mapping will absorb register in IOB process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process; process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if; end process; process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then Out(22:0) <= Data_in(22:0); else Out(22:0) ‘Z’); end if; end process; TRI CLK DQ TRI_R1 DATA [23] OUT [23] fanout = 1 TRI CLK D Q TRI_R2 OUT [22:0] DATA [22:0] fanout = 23

Area Efficient Muxes using TBUFs Improve area efficiency by using tri-states Each CLB has 2 TBUFs assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z; assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z; assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z; assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z; case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0]; endcase E[3:0] A[7:0] B[7:0] C[7:0] D[7:0] Z[7:0] A[7:0] B[7:0] C[7:0] D[7:0] E0 E1 E2 E3 Z[7:0]

TBUFs as Muxes Performance Summary Improve area efficiency by using tri-states But often slower than equivalent muxes under most circumstance Too much delay getting onto TBUF Each CLB has 2 TBUFs PAR can connect tri-states on multiple horizontal long lines to build wide muxes

Distributed RAM Inferencing System Memory module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk; reg [3:0] mem [7:0]; assign q = mem[addr]; always @(posedge clk) begin if(we) mem[addr] = d; end endmodule Synplicity (RAM 8x4) AO A1 A2 A3 D WCLK WE AO A1 A2 D WCLK WE Addr [2:0] D [3:0] clk we q [3:0] RAM 16x1S Q..... Synplify and LeonardoSpectrum can infer distributed RAM FPGA Express will support RAM inferencing in future

Registered IO Mapping System Interfaces  System Timing Chip to chip performance limits system speeds  No need to instantiate IOB register cells Implementation tools will pack registers in the IO map -pr b b (both input and output) i (input only) o (output only) IOB = TRUE attribute  Mapping for data and enable ports S/R D CE CLK S/R Q OBUF Q CE D CLK IBUF

Instantiating Technology Specific Features  Block RAM System Memory  CLKDLL Minimizes clock skew  Special IOs Interfacing with standard buses  LUTs for Datapath pipelining Add latency with minimal area impact

LUTs for Datapath pipelining  LUT can be used in place of registers to balance pipeline stages Area efficient implementation  SRL16E can delay an input value up to 16 clock cycles - Sync up operands before the next operation F G H A[31:0] B[31:0] C[31:0] Z 8 cycles 5 cycles 1 cycle SRL16E D CE CLK A3 A2 A1 A0 Q 7 SRL16E D CE CLK A3 A2 A1 A0 Q 12 32 LUTs replace 256 registers 32 LUTs replace 416 registers

Block RAM: System Memory RAMB4_S1 U1 (.WE(WE),.EN(EN),.RST(RST),.CLK(CLK),.ADDR(ADDR),.DI(DI),.DO(DO)); component RAMb4_S1 port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0)); end component; begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO); RAMB4_S1 do DO addr en we rst clk di ADDR WE EN RST DI CLK  Instantiate single and dual port RAM  Use CoreGen to build RAM and FIFO (Q1 ‘99)

wire clk_fb; BUFGDLL U4 (.I(clkin),.O(clk_fb)); BUFG CLKIN CLKFB RST CLKDLL CLK0 CLK90 CLK180 CLK270 CLK2X CLKDV LOCKED IBUFG U4 clkin rst clk_fb I O Virtex CLKDLL  Minimize clock to out pad delay Removes all delay from external GCLKPAD pin to the registers and RAM  BUFGDLL is available for instantiation Other configurations can be built by instantiating the CLKDLL macro  UCF only way to configure CLKDLL or BUFGDLL In future would like to use generics (VHDL) and parameters (Verilog) but synthesizers don't pass them on yet

Special IO Buffers: System Interfaces  Default IO buffer is LVTTL (12mA), available via inference Process technology leads to mixed voltage systems High performance, low power signal standards emerging  Instantiate IO buffers for non default current drive non default voltage standard non default slew OBUF_AGP U0 (.I(awire),.O(oport)); OBUF_F_24 U1 (.I(awire),.O(oport)); awire oport U0 awire oport U1 Advanced Graphics Port bus interface (Pentium II graphics) Fast slew rate and 24 mA drive strength

Summary  Efficient HDL coding allows designers to build high performance designs  Designers should consider the underlying hardware as they code, to achieve best results  Exploit the hardware’s features for best performance

Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary.

Similar presentations

Presentation on theme: "Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary.

Similar presentations

Presentation on theme: "Winning with HDL. AGENDA  Introduction  HDL coding techniques  Virtex hardware  Summary."— Presentation transcript:

Similar presentations

About project

Feedback