George Mason University FPGA Devices & FPGA Design Flow ECE 448 Lecture 7
2 Required reading P. Chu, RTL Hardware Design using VHDL Chapter 1, Introduction to Digital System Design Spartan-6 FPGA CLB, User Guide CLB Overview Slice Description
3 designs must be sent for expensive and time consuming fabrication in semiconductor foundry bought off the shelf and reconfigured by designers themselves Two competing implementation approaches ASIC Application Specific Integrated Circuit FPGA Field Programmable Gate Array designed all the way from behavioral description to physical layout no physical layout design; design ends with a bitstream used to configure a device
4 Which Way to Go? Off-the-shelf Low development cost Short time to market Reconfigurability High performance ASICsFPGAs Low power Low cost in high volumes
5 Block RAMs Configurable Logic Blocks I/O Blocks What is an FPGA? Block RAMs
6 Modern FPGA Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Multipliers/DSP units RAM blocks Logic resources (#Logic resources, #Multipliers/DSP units, #RAM_blocks)
7 Major FPGA Vendors SRAM-based FPGAs Xilinx, Inc. Altera Corp. Lattice Semiconductor Atmel Flash & antifuse FPGAs Actel Corp. (Microsemi SoC Products Group) Quick Logic Corp. ~ 51% of the market ~ 34% of the market ~ 85%
8 Xilinx Primary products: FPGAs and the associated CAD software Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Seiko Epson (Japan) TSMC (Taiwan) Samsung (Korea) Programmable Logic Devices ISE Alliance and Foundation Series Design Software
TechnologyLow-costHigh- performance 220 nmVirtex 180 nmSpartan II, Spartan IIE 120/150 nmVirtex II, Virtex II Pro 90 nmSpartan 3Virtex 4 65 nmVirtex 5 45 nmSpartan 6 40 nmVirtex 6 28 nmArtix 7Virtex 7 Xilinx FPGA Families
Altera FPGA Families TechnologyLow-costMid-rangeHigh- performance 130 nmCycloneStratix 90 nmCyclone IIStratix II 65 nmCyclone IIIArria IStratix III 40 nmCyclone IVArria IIStratix IV 28 nmCyclone VArria VStratix V
11 FPGA Family
12 Spartan 6 FPGA Family
George Mason University CLB Structure
14 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( General structure of an FPGA
15 Xilinx Spartan 6 CLB
16 Row & Column Relationship Between CLBs & Slices
17 Three Different Types of Slices 50%25%
18 SLICEX
19 SLICEL
20 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Xilinx Multipurpose LUT (MLUT) 64 x 1 ROM (logic) 64 x 1 RAM 32-bit SR
21 4-input LUT (Look-Up Table) in the Basic ROM Mode Look-Up tables are primary elements for logic implementation Each LUT can implement any function of 4 inputs
22 6-Input LUT of Spartan 6
23
24 Reset and Set Configurations No set or reset Synchronous set Synchronous reset Asynchronous set (preset) Asynchronous reset (clear)
25 MLUT as a 32-bit Shift Register (SRL32)
26 Each CLB contains separate logic and routing for the fast generation of sum & carry signals Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Carry logic is independent of normal logic and routing resources Fast Carry Logic LSB MSB Carry Logic Routing
27 Accessing Carry Logic All major synthesis tools can infer carry logic for arithmetic functions Addition (SUM <= A + B) Subtraction (DIFF <= A - B) Comparators (if A < B then…) Counters (count <= count +1)
Full-adder x y c out s FA x + y + c in = ( c out s ) xy c out s c in
xy COUT yyyy CIN Propagate = x y Generate = y Sum= Propagate CIN = x y CIN x y Carry & Control Logic in Xilinx FPGAs
Carry & Control Logic in Spartan 6 FPGAs LUT Hardwired (fast) logic x y
George Mason University Examples: Determine the amount of Spartan 6 resources needed to implement a given circuit
Circuit 1: Top level
cin xy cout s <<<3 x3 x2 x1 x0 y3 y2 y1 y0 w1 w0 En y3 y2 y1 y0 a b c d a b c d c a b e e f 3 2-to-4 Decoder Full Adder f g h g h y Circuit 1: F – function
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 z abcdeabcde y F d clk 01 run Circuit 2: Top level
xy cout s >>2 x3 x2 x1 x0 y3 y2 y1 y0 y1 y0 z w3 w2 w1 w0 a b c d a e f g h 3 Priority Encoder Half Adder g h i e i y a b c d Circuit 2: F – function
Circuit 3: Top level
George Mason University Input/Output Blocks (IOBs)
39 Basic I/O Block Structure D EC Q SR D EC Q SR D EC Q SR Three-State Control Output Path Input Path Three-State Output Clock Set/Reset Direct Input Registered Input FF Enable
40 IOB Functionality IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered advised for high-performance I/O Inputs can be delayed
George Mason University Clock Management
42 A simple clock tree The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. (
43 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Clock Manager
44 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Jitter
45 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Removing Jitter
46 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Frequency Synthesis
47 Figure 4-20 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Phase shifting
48 DCM – Digital Clock Manager PLL - Phase Locked Loop Clock Management Tiles
George Mason University Spartan-6 Family Attributes
50 Spartan-6 FPGA Family Members
51 FPGA device present on the Digilent Nexys 3 board XC6SLX16-CSG324C Spartan 6 family Size 324 pins Package type (Ball Chip-Scale) Commercial temperature range 0° C – 85° C Logic Optimized
George Mason University FPGA Design Flow
FPGA Design process (1) Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds….. Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core; Specification / Pseudocode VHDL description (Your Source Files) Functional simulation Post-synthesis simulation Synthesis On-paper hardware design (Block diagram & ASM chart)
FPGA Design process (2) Implementation Configuration Timing simulation On chip testing
55 Tools used in FPGA Design Flow Xilinx XST Design Synthesis Implementation Xilinx ISE VHDL code Netlist Bitstream Synplify Premier Functionally verified VHDL code
George Mason University Synthesis
57 Synthesis Tools … and others Synplify Premier Xilinx XST
58 architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1; MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1; with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW; VHDL description Circuit netlist Logic Synthesis
59 Circuit netlist (RTL view)
60 Mapping LUT2 LUT3 LUT4 LUT5 LUT1 FF1 FF2 LUT0
61 Xilinx XST Inputs/Outputs
62 Xilinx XST Inputs RTL VHDL and/or Verilog files Constraints – XCF Xilinx constraints file in which you can specify synthesis, timing, and specific implementation constraints that can be propagated to the NGC file. Core files These files can be in either NGC or EDIF format. XST does not modify cores. It uses them to inform area and timing optimization surrounding the cores.
63 Xilinx XST Outputs NGC Netlist file with constraint information NGR This is a schematic representation of the pre-optimized design shown at the Register Transfer Level (RTL). This representation is in terms of generic symbols, such as adders, multipliers, counters, AND gates, and OR gates, and is generated after the HDL synthesis phase of the synthesis process. LOG This report contains the results from the synthesis run, including area and timing estimation.
RTL view in Synplify Premier incrementercomparator General logic structures can be recognized in RTL view MUX
Crossprobing between RTL view and code Each port, net or block can be chosen by mouse click from the browser or directly from the RTL View By double-clicking on the element its source code can be seen: Reverse crossprobing is also possible: if section of code is marked, appropriate element of RTL View is marked too:
Technology View in Synplify Pro Technology view is a mapped RTL view. It can be seen by pressing button or by double-click on “.srm” file As in case of “RTL View”, buttons can be used here Two additional buttons are enabled: - show critical path - open timing analyst - open timing analyst Technology view is presented using device primitives Ports, nets and blocks browser Pay attention: technology view is usually large and presented on number of sheets
Viewing critical path Critical path can be viewed by pressing on Delay values are written near each component of the path
George Mason University Implementation
69 Implementation After synthesis the entire implementation process is performed by FPGA vendor tools
70 Implementation
71 Translation UCF NGD Native Generic Database file Constraint Editor or Text Editor User Constraint File Circuit Netlist Timing Constraints Synthesis
72 Mapping LUT2 LUT3 LUT4 LUT5 LUT1 FF1 FF2 LUT0
73 Placing CLB SLICES FPGA
74 Routing Programmable Connections FPGA
75 Configuration Once a design is implemented, you must create a file that the FPGA can understand This file is called a bit stream: a BIT file (.bit extension) The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information
Two main stages of the FPGA Design Flow Synthesis Technologyindependent Technologydependent Implementation RTL Synthesis Map Place & Route Place & Route Configure - Code analysis - Derivation of main logic constructions - Technology independent optimization - Creation of “RTL View” - Mapping of extracted logic structures to device primitives - Technology dependent optimization - Application of “synthesis constraints” -Netlist generation - Creation of “Technology View” - Placement of generated netlist onto the device -Choosing best interconnect structure for the placed design -Application of “physical constraints” - Bitstream generation - Burning device
77 Synthesis Report Example – Resource Utilization (1) Device utilization summary: Selected Device : 6slx4tqg144-3 Slice Logic Utilization: Number of Slice Registers: 53 out of % Number of Slice LUTs: 163 out of % Number used as Logic: 163 out of % Slice Logic Distribution: Number of LUT Flip Flop pairs used: 198 Number with an unused Flip Flop: 145 out of % Number with an unused LUT: 35 out of % Number of fully used LUT-FF pairs: 18 out of 198 9% Number of unique control sets: 7
78 Synthesis Report Example – Resource Utilization (2) IO Utilization: Number of IOs: 43 Number of bonded IOBs: 43 out of % Specific Feature Utilization: Number of BUFG/BUFGCTRLs: 1 out of 16 6% Number of DSP48A1s: 5 out of 8 62%
79 Synthesis Report Example – Timing Timing Summary: Speed Grade: -3 Minimum period: 6.031ns (Maximum Frequency: MHz)
80 Map Report Example – Resource Utilization (1) Design Summary Slice Logic Utilization: Number of Slice Registers: 54 out of 4,800 1% Number used as Flip Flops: 53 Number used as Latches: 0 Number used as Latch-thrus: 0 Number used as AND/OR logics: 1 Number of Slice LUTs: 149 out of 2,400 6% Number used as logic: 148 out of 2,400 6% Number using O6 output only: 133 Number using O5 output only: 0 Number using O5 and O6: 15 Number used as ROM: 0 Number used as Memory: 0 out of 1,200 0% Number used exclusively as route-thrus: 1
81 Map Report Example – Resource Utilization (2) Slice Logic Distribution: Number of occupied Slices: 58 out of 600 9% Number of MUXCYs used: 32 out of 1,200 2% Number of LUT Flip Flop pairs used: 162 Number with an unused Flip Flop: 109 out of % Number with an unused LUT: 13 out of 162 8% Number of fully used LUT-FF pairs: 40 out of % Number of unique control sets: 7 Number of slice register sites lost to control set restrictions: 35 out of 4,800 1% IO Utilization: Number of bonded IOBs: 43 out of %
82 Map Report Example – Resource Utilization (3) Specific Feature Utilization: Number of RAMB16BWERs: 0 out of 12 0% Number of RAMB8BWERs: 0 out of 24 0% ……. Number of DSP48A1s: 5 out of 8 62% …….
83 Post-PAR Static Timing Report Clock to Setup on destination clock clk_i | Src:Rise| Src:Fall| Src:Rise| Src:Fall| Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall| clk_i | 7.530| | | |
84 PAR Report Constraint | Check | Worst Case | Best Case | Timing | Timing | | Slack | Achievable | Errors | Score Autotimespec constraint for clock net clk | SETUP | N/A| 7.530ns| N/A| 0 _i_BUFGP | HOLD | 0.457ns| | 0|
85 Timing Report (1) Timing constraint: Default period analysis for net "clk_i_BUFGP" 3354 paths analyzed, 309 endpoints analyzed, 0 failing endpoints 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum period is 7.530ns Delay (setup path): 7.530ns (data path - clock path skew + uncertainty) Source: a_register/q_o_4 (FF) Destination: x_reg_inst/q_o_3 (FF) Data Path Delay: 7.453ns (Levels of Logic = 2) Clock Path Skew: ns ( ) Source Clock: clk_i_BUFGP rising Destination Clock: clk_i_BUFGP rising Clock Uncertainty: 0.035ns
86 Timing Report (2) Maximum Data Path at Slow Process Corner: a_register/q_o_4 to x_reg_inst/q_o_3 Location Delay type Delay(ns) Physical Resource Logical Resource(s) SLICE_X4Y36.AQ Tcko a_register/q_o a_register/q_o_4 DSP48_X0Y3.B4 net (fanout=21) a_register/q_o DSP48_X0Y3.M3 Tdspdo_B_M Mmult_mult_unsigned Mmult_mult_unsigned SLICE_X8Y39.C4 net (fanout=1) mult_unsigned SLICE_X8Y39.CLK Tas x_reg_inst/q_o Mmux_x_57 Mmux_x_4_f7_2 Mmux_x_2_f8_2 x_reg_inst/q_o_ Total 7.453ns (4.209ns logic, 3.244ns route) (56.5% logic, 43.5% route)
87 Timing Report (3) Delay (setup path): 7.484ns (data path - clock path skew + uncertainty) Source: a_register/q_o_7_1 (FF) Destination: x_reg_inst/q_o_3 (FF) Data Path Delay: 7.391ns (Levels of Logic = 2) Clock Path Skew: ns ( ) Source Clock: clk_i_BUFGP rising Destination Clock: clk_i_BUFGP rising Clock Uncertainty: 0.035ns Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE Total System Jitter (TSJ): 0.070ns Total Input Jitter (TIJ): 0.000ns Discrete Jitter (DJ): 0.000ns Phase Error (PE): 0.000ns
88 Timing Report (4) Maximum Data Path at Slow Process Corner: a_register/q_o_7_1 to x_reg_inst/q_o_3 Location Delay type Delay(ns) Physical Resource Logical Resource(s) SLICE_X2Y33.AQ Tcko a_register/q_o_7_2 a_register/q_o_7_1 DSP48_X0Y3.B7 net (fanout=13) a_register/q_o_7_1 DSP48_X0Y3.M3 Tdspdo_B_M Mmult_mult_unsigned Mmult_mult_unsigned SLICE_X8Y39.C4 net (fanout=1) mult_unsigned SLICE_X8Y39.CLK Tas x_reg_inst/q_o Mmux_x_57 Mmux_x_4_f7_2 Mmux_x_2_f8_2 x_reg_inst/q_o_ Total 7.391ns (4.209ns logic, 3.182ns route) (56.9% logic, 43.1% route)
Xilinx FPGA Memories
90 Recommended reading Spartan-6 FPGA Block RAM Resources: User Guide Google search: UG383 Spartan-6 FPGA Configurable Logic Block: User Guide Google search: UG384 Xilinx FPGA Embedded Memory Advantages: White Paper Google search: WP360 ISE In-Depth Tutorial, Section: Creating a CORE Generator Tool Module Google search: ISE In-Depth Tutorial
91 Memory Types
92 Memory Types Memory RAMROM Single portDual port With asynchronous read With synchronous read Memory
93 Memory Types specific to Xilinx FPGAs Memory Distributed (MLUT-based) Block RAM-based (BRAM-based) InferredInstantiated Memory Manually Using CORE Generator
CORE Generator
96 FPGA Distributed Memory
97 Location of Distributed RAM Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( DSP units RAM blocks Logic resources (#Logic resources, #Multipliers/DSP units, #RAM_blocks) Logic resources (CLB slices)
98 Three Different Types of Slices 50%25%
99 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( Spartan-6 Multipurpose LUT (MLUT) 64 x 1 ROM (logic) 64 x 1 RAM 32-bit SR
100 Single-port 64 x 1-bit RAM
101 Memories Built of Neighboring MLUTs Single-port 128 x 1-bit RAM: RAM128x1S Dual-port 64 x 1-bit RAM : RAM64x1D Memories built of 2 MLUTs: Memories built of 4 MLUTs: Single-port 256 x 1-bit RAM: RAM256x1S Dual-port 128 x 1-bit RAM: RAM128x1D Quad-port 64 x 1-bit RAM:RAM64x1Q Simple-dual-port 64 x 3-bit RAM:RAM64x3SDP (one address for read, one address for write)
102 Dual-port 64 x 1 RAM ECE 448 – FPGA and ASIC Design with VHDL Dual-port 64 x 1-bit RAM : 64x1D Single-port 128 x 1-bit RAM: 128x1S
103 Total Size of Distributed RAM
104 FPGA Block RAM
105 Location of Block RAMs Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN Copyright © 2004 Mentor Graphics Corp. ( DSP units RAM blocks Logic resources (#Logic resources, #Multipliers/DSP units, #RAM_blocks) Logic resources (CLB slices)
106 Spartan-6 Block RAM Amounts
107 Block RAM can have various configurations (port aspect ratios) 0 16, , , k x 1 8k x 2 4k x 4 2k x (8+1) 1024 x (16+2)
108
109
110 Block RAM Port Aspect Ratios
111 Block RAM Interface
112 Block RAM Ports
113 Block RAM with synchronous read in Read-First Mode CEEN
114 Features of Block RAMs in Spartan-6 FPGAs