Winning with HDL
AGENDA Introduction HDL coding techniques Virtex hardware Summary
Coding for Performance Gate Arrays are relatively tolerant of poor coding styles and design practices 66 MHz is easy for an Gate Array Designs coded for a Gate Array tend to perform 3x slower when converted to a FPGA Not uncommon to see up to 30 layers of logic and MHz FPGA designs 6-8 FPGA Logic Levels = 50 MHz FPGAs require different coding styles and more effective design methodologies to reach gate array system speeds.
Coding for Performance Common mistake is to ignore hardware and start coding as if programming. To achieve best performance, the designer must think about the hardware. Improve performance by: avoiding unnecessary priority structures in logic optimizing logic for late-arriving signals structuring arithmetic for performance avoiding area-inefficient code buffering high fanout signals pipelining for high performance exploiting high performance cores from CoreGen
Effective Coding Style Case vs. If-Then-Else in0 in1 in2 in3 mux_out sel in0 in1 in2 in3 sel=00 sel=01 sel=10 p_encoder_out module mux (in0, in1, in2, in3, sel, mux_out); inputin0, in1, in2, in3; input[1:0] sel; outputmux_out; regmux_out; or in1 or in2 or in3 or sel) begin case (sel) 2'b00:mux_out = in0; 2'b01:mux_out = in1; 2'b10:mux_out = in2; default:mux_out = in3; endcase end endmodule module p_encoder (in0, in1, in2, in3, sel, p_encoder_out); inputin0, in1, in2, in3; input[1:0] sel; outputp_encoder_out; regp_encoder_out; or in1 or in2 or in3 or sel) begin if (sel == 2'b00) p_encoder_out = in0; else if (sel == 2'b01) p_encoder_out = in1; else if (sel == 2'b10) p_encoder_out = in2; else p_encoder_out = in3; end endmodule Generally, If-Else is slower unless you intend to build a priority encoder!
Priority Encoder “if-then-else” When to use? Assign highest priority to a late arriving critical signal Nested “if-then-else” can increase area and delay Use “case” statement if possible to describe the same function or in) begin if (sel == 3'h0) out = in[0]; else if (sel == 3'h1) out = in[1]; else if (sel == 3'h2) out = in[2]; else if (sel == 3'h3) out = in[3]; else if (sel == 3'h4) out = in[4]; else out = in[5]; end in [4] in [3] S S S S in [2] in [1] in [0]
Benefits of “case” statement or D or E or F or S) begin case (S) 2’b000 : Z = C; 2’b001 : Z = D; 2’b010 : Z = E; 2’b011 : Z = F; 2’b100 : Z = G; 2’b101 : Z = H; 2’b110 : Z = I; default : Z = J; endcase C D E F G H I J S Z 8:1 Mux Compact and delay optimized implementation Implemented in a single CLB Synthesis maps to MUXF5 and MUXF6 functions 4:1 MUX is implemented in a single CLB slice
Effective Coding Style Optimize for the Critical Path critical in0 in1 in2 in3 out in2 in0 in1 in3 critical out module critical_bad (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = (((in0&in1) & ~critical) | ~in2) & ~in3; endmodule module critical_good (in0, in1, in2, in3, critical, out); inputin0, in1, in2, in3, critical; outputout; assign out = ((in0&in1) | ~in2) & ~in3 & ~critical; endmodule Minimize the critical path where possible
-- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- No parentheses OUT1 <= I1 + I2 + I3 + I4 -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) -- With parentheses OUT1 <= (I1 + I2) + (I3 + I4) I1 I2 I3 I4 OUT1 I4 I1 I2 I3 OUT1 Structuring Arithmetic for Performance Know your tools: use Synthesis directives, options (vendor specific) Area, Speed, Ungrouping and flattening, Resource sharing, "DesignWare" libraries Attributes - ripple, look-ahead, fastest, smallest. –i.e. // exemplar attribute out1 modgen_sel fastest LogiBlox, CORE Generator if vendor hasn't fully tuned yet Use parentheses to control logical structure
How to use the Carry-In in FPGA Express In FPGA Express, concatenate the Carry-In to get an adder with carry (Adder_c). Without concatenation (Adder_b), you would end up with 2 adders. In other tools, like Leonardo, Adder_b will generate a single adder with carry-in -- no concatenation is necessary. // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_A // No carry-in AOUT = AIN1 + AIN2; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_B // Carry-in used but 2 adders BOUT = BIN1 + BIN2 + BCARRYIN; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN}; // ADDER_C // Carry-in used with only 1 adder required // Concatenate {COUT, CCARRYOUT} = {CIN1,CCARRYIN} + {CIN2,CCARRYIN};
Verilog Notes For CASE statements, be sure to use your synthesis vendor’s syntax to ensure optimum performance. Full_case syntax allows you to avoid unwanted latches Parallel_case syntax allows you to ensure a parallel (as opposed to priority encoded) hardware implementation in case statements where all cases are mutually exclusive. Use “Don’t-Cares” to speed up your design and reduce area
Avoid inefficient code a0 b0 + + a1 b1 sum sel + sum sel a0 a1 b0 b1 module poor_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; or a1 or b0 or b1 or sel) begin if (sel) sum = a1 + b1; else sum = a0 + b0; end endmodule module good_resource_sharing (a0, a1, b0, b1, sel, sum); inputa0, a1, b0, b1, sel; outputsum; regsum; rega_temp, b_temp; or a1 or b0 or b1 or sel) begin if (sel) begin a_temp = a1; b_temp = b1; end else begin a_temp = a0; b_temp = b0; end sum = a_temp + b_temp; end endmodule Use 2 muxes rather than 2 adders to reduce resource usage
Duplicate Registers to Reduce Fan-Out module low_fanout(in, en, clk, out); input[23:0] in; input en, clk; output[23:0] out; reg[23:0] out; regtri_en1, tri_en2; clk) begin tri_en1 = en; tri_en2 = en; end or in)begin if (tri_en1) out[23:12] = in[23:12]; else out[23:12] = 12'bZ; end or in) begin if (tri_en2) out[11:0] = in[11:0]; else out[11:0] = 12'bZ; end endmodule module high_fanout(in, en, clk, out); input[23:0]in; inputen, clk; output[23:0] out; reg[23:0] out; regtri_en; clk) tri_en = en; or in) begin if (tri_en) out = in; else out = 24'bZ; end endmodule en clk [23:0]in[23:0]out tri_en en clk [23:0]in [23:0]out en clk 24 loads 12 loads tri_en1 tri_en2
Design Partition - Reg at Boundary a0 clk a1 clk + sum + a0 a1 clk sum module reg_at_boundary (a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; clk) begin sum = a0 + a1; end endmodule module reg_in_module(a0, a1, clk, sum); inputa0, a1, clk; outputsum; regsum; reg a0_temp, a1_temp; clk) begin a0_temp = a0; a1_temp = a1; end or a1_temp) begin sum = a0_temp + a1_temp; end endmodule
Pipeline for Performance 1 cycle module no_pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp; clk) begin out = (a_temp * b_temp) + c_temp; a_temp = a; b_temp = b; c_temp = c; end endmodule module pipeline (a, b, c, clk, out); inputa, b, c, clk; outputout; regout; rega_temp, b_temp, c_temp1, c_temp2, mult_temp; clk) begin mult_temp = a_temp * b_temp; a_temp = a; b_temp = b; end clk) begin out = mult_temp + c_temp2; c_temp2 = c_temp1; c_temp1 = c; end endmodule * + a b c out 2 cycle * + a b c out Pipeline to increase performance
Take Advantage of Virtex Hardware Use flip-flops and pipeline! FPGA’s contain hordes of flip-flops. Virtex gives you 4 DLL’s that can be used to synchronize clocks for superior system timing Use the optimized cores from CoreGen to get high performance, pipelined arithmetic and sophisticated functional blocks.
RTL Flexibility for Register Configurations Register Mapping for Registers with sync/async set and reset Clocks, inverted clocks, and clock enable Positive Edge Triggered Flip-Flop with clock enable, sync clear and preset clk or posedge preset) begin if (preset) q = 1; else if (reset) q = 0; else if (CE) q = data; end reset data clk q preset ce
Timing Driven Register IOB Mapping Technology Mapping will not duplicate registers Critical signal will not be absorbed in the IOB register process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if; end process; process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out ‘Z’); end if; end process; TRI TRI_R CLK DQ DATA [23:0] OUT [23:0] fanout = 24
Timing Driven Register IOB Mapping Duplicate register on critical path for fanout of 1 Mapping will absorb register in IOB process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process; process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if; end process; process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then Out(22:0) <= Data_in(22:0); else Out(22:0) ‘Z’); end if; end process; TRI CLK DQ TRI_R1 DATA [23] OUT [23] fanout = 1 TRI CLK D Q TRI_R2 OUT [22:0] DATA [22:0] fanout = 23
Area Efficient Muxes using TBUFs Improve area efficiency by using tri-states Each CLB has 2 TBUFs assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z; assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z; assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z; assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z; case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0]; endcase E[3:0] A[7:0] B[7:0] C[7:0] D[7:0] Z[7:0] A[7:0] B[7:0] C[7:0] D[7:0] E0 E1 E2 E3 Z[7:0]
TBUFs as Muxes Performance Summary Improve area efficiency by using tri-states But often slower than equivalent muxes under most circumstance Too much delay getting onto TBUF Each CLB has 2 TBUFs PAR can connect tri-states on multiple horizontal long lines to build wide muxes
Distributed RAM Inferencing System Memory module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk; reg [3:0] mem [7:0]; assign q = mem[addr]; clk) begin if(we) mem[addr] = d; end endmodule Synplicity (RAM 8x4) AO A1 A2 A3 D WCLK WE AO A1 A2 D WCLK WE Addr [2:0] D [3:0] clk we q [3:0] RAM 16x1S Q..... Synplify and LeonardoSpectrum can infer distributed RAM FPGA Express will support RAM inferencing in future
Registered IO Mapping System Interfaces System Timing Chip to chip performance limits system speeds No need to instantiate IOB register cells Implementation tools will pack registers in the IO map -pr b b (both input and output) i (input only) o (output only) IOB = TRUE attribute Mapping for data and enable ports S/R D CE CLK S/R Q OBUF Q CE D CLK IBUF
Instantiating Technology Specific Features Block RAM System Memory CLKDLL Minimizes clock skew Special IOs Interfacing with standard buses LUTs for Datapath pipelining Add latency with minimal area impact
LUTs for Datapath pipelining LUT can be used in place of registers to balance pipeline stages Area efficient implementation SRL16E can delay an input value up to 16 clock cycles - Sync up operands before the next operation F G H A[31:0] B[31:0] C[31:0] Z 8 cycles 5 cycles 1 cycle SRL16E D CE CLK A3 A2 A1 A0 Q 7 SRL16E D CE CLK A3 A2 A1 A0 Q LUTs replace 256 registers 32 LUTs replace 416 registers
Block RAM: System Memory RAMB4_S1 U1 (.WE(WE),.EN(EN),.RST(RST),.CLK(CLK),.ADDR(ADDR),.DI(DI),.DO(DO)); component RAMb4_S1 port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0)); end component; begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO); RAMB4_S1 do DO addr en we rst clk di ADDR WE EN RST DI CLK Instantiate single and dual port RAM Use CoreGen to build RAM and FIFO (Q1 ‘99)
wire clk_fb; BUFGDLL U4 (.I(clkin),.O(clk_fb)); BUFG CLKIN CLKFB RST CLKDLL CLK0 CLK90 CLK180 CLK270 CLK2X CLKDV LOCKED IBUFG U4 clkin rst clk_fb I O Virtex CLKDLL Minimize clock to out pad delay Removes all delay from external GCLKPAD pin to the registers and RAM BUFGDLL is available for instantiation Other configurations can be built by instantiating the CLKDLL macro UCF only way to configure CLKDLL or BUFGDLL In future would like to use generics (VHDL) and parameters (Verilog) but synthesizers don't pass them on yet
Special IO Buffers: System Interfaces Default IO buffer is LVTTL (12mA), available via inference Process technology leads to mixed voltage systems High performance, low power signal standards emerging Instantiate IO buffers for non default current drive non default voltage standard non default slew OBUF_AGP U0 (.I(awire),.O(oport)); OBUF_F_24 U1 (.I(awire),.O(oport)); awire oport U0 awire oport U1 Advanced Graphics Port bus interface (Pentium II graphics) Fast slew rate and 24 mA drive strength
Summary Efficient HDL coding allows designers to build high performance designs Designers should consider the underlying hardware as they code, to achieve best results Exploit the hardware’s features for best performance