Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1

Similar presentations


Presentation on theme: "Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1"— Presentation transcript:

1 Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates
Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1 1. Electrical Engineering Dept., UCLA 2. Research Labs, Xilinx Inc. Presented by Yu Hu Address comments to

2 Outline Introduction Design of the Macro-gates
Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Below, I’ll first give the design metholodgy of the macro-gate, and then describe our synthesis flow for the proposed new architecture, which is followed by the experimental results and conclusions. 2

3 Heterogeneity in FPGA Architectures
Heterogeneity among SLICEs Programmable logic and routing Tiles are not identical soft logic fabric [Kaviani, FPGA’96]] hard structures [Jamieson, FPL’05] Dedicated hard structures e.g. DSP e.g memory block Heterogeneity within a SLICE Tiles (SLICEs) are identical Different logics exist within a SLICE e.g. LUTs with different size [Cong, FPGA’99] e.g. mixed PLAs and LUTs [Cong, TODAES’05] e.g. mixed macro-gates and LUTs (source: As we have known, most of the modern FGPA architectures are heterogeneous. There are two categories of heterogeneities. One is heterogeneity among SLICEs, the logic block can be DSP block, memory block or LUTs. The other category is heterogeneity within a SLICE, where each SLICE shares the same structure but there are different embedded elements within each SLICE. For example, LUTs with different sizes, mixed PLA and LUTs and mixed macro-gates and LUTs. In this paper, we’ll study the heterogeneous FPGA with mixed macro-gate and LUTs.

4 Heterogeneous FPGA with Macro-Gates
There exists programmability and cost trade-off between LUTs and macrogates Xilinx V4 benefits from small gates (MUX2, XOR2) built in SLICEs. The benefit of wider macro-gates Effectiveness of the incorporation of wider logic functions (macro gates) is not clear. Our contributions Design a new FPGA architecture with mixed LUTs and macro-gates Propose a new automatic synthesis flow for mapping a circuit to the proposed FPGA architecture Evaluate the architecture and show that the proposed architecture reduces delay and area by 16.5% and 30%, respective, compared to the LUT-only architecture. A macro-gate is embedded logic cell with fixed functions. Since there exists programmability and cost trade-off between LUTs and macrogates, industrial FPGAs have been using small macro-gates such as MUX2, XOR2 to reduce delay and increase the logic density. However, it’s not clear if it’s beneficial to integrate macro-gates with wider logic functions into FPGAs. To answer this question, this paper proposes a new FPGA architecture with mixed LUTs and macro gates and provide a set of synthesis tools to make full use of the macro-gates.

5 Outline Introduction Design of the Macro-gates
Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Below, I’ll first give the design metholodgy of the macro-gate, and then describe our synthesis flow for the proposed new architecture, which is followed by the experimental results and conclusions.

6 Overview of Macro-Gate Design
Key problem Select the logic functions for the macro-gate Problem formulation: Input: a set of training circuits, which have been mapped to K-input LUTs Output: N K-input Boolean functions: f1 , … , fN Objective: Maximize the number of logics (in the training circuit set) which can be implemented by f1 , … , fN The proposed solution Ranking of the logic functions for a set of training circuits The key problem in the macro-gate design to select the logic functions in the macro-gate. This problem can be formulated as follows: given a set of training circuits, which have been mapped to N-input LUTs, we need to find K N-input Boolean functions which maximize the number of logics that can be implemented by them. In the following slides, I’ll propose our methodology for logic functions ranking for the macro-gate design.

7 NPN-Class Diagram: Organization of Logics
Canonical and efficient representation of all NPN classes NPN-Equivalent: functional equivalency under inputs negation, permutation or output negation E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c NPN-Cofactor relationship is indicated DAG: easy to manipulate It becomes impractical to compute for more than 6-input functions! Solution: Utilization NPN-Class Diagram Wider inputs Level3: 3-input Level2: 2-input Before describing how we rank the logic functions, we first describe the data structure that we use to organize the logics, namely the NPN-class diagram. As we have known, NPN-equivalency is a canonical representation of the logic functions. Two logic functions are NPN-equivalent if they are functional equivalent under inputs negation, permutation or output negation. For example, the following two Boolean functions are NPN-equivalent, which means that they can be implemented by the same logic gate with input/output negations. The following NPN-class diagram organizes all NPN classes by a DAG. For example, there is one NPN-class for constant function, one for 1-input function, 2 for 2-input functions and 10 NPN-classes for 3-input functions. Each edge indicates that the upper level function can implement the lower level function. Unfortunately, it becomes impractical to compute all NPN-classes for wider functions, and the solution is to use utilization NPN-class diagram. Level1: 1-input Level0: constant

8 UND: Utilization NPN-Class Diagram
UND is an DAG, sub-graph of NCD Help for scoring and ranking functions ab’c’+a’bc’ abc/ 1 / xx% ab’c’+a’bc’ / 1 / xx% abc ab’+a’b a ab’+a’b / 0 / xx% ab / 0 / xx% a / 0 / xx% The utilization NPN-class diagram is a sub-graph of the NPN-class diagram. Let’s take a look at an example to see how we build the UND. Suppose we have four logics functions in the training circuits. For each function, we calculate its NPN-class and add all nodes in the fanout cone to UND according to NPN-class diagram. There are three fields for each node in the diagram, which are functionality, appearance frequency and implementation capability. Whenever a node is added, its appearance frequency is updated. Basically, this one means that this function appears once in the training circuits. Then we process the next function, and so on. When all functions have been processed, we calculate the implementation capability for each node in the diagram. This number is calculated recursively based on the implementation capability of the nodes in its fanout cone. This means the percentage of logic functions that can be implemented by this NPN-class. Based on the implementation capability, the ranking of the NPN-classes is obtained. Of course, we can define other interesting metrics in this Utilization NPN-class diagram for logic ranking. Implementation capability -0- / 0 / xx% functionality Appearance frequency

9 UND: Utilization NPN-Class Diagram
ab’c’+a’bc’ abc/ 1 / xx% ab’c’+a’bc’ / 1 / xx% abc ab’+a’b a ab’+a’b / 0 / xx% ab’+a’b / 1 / xx% ab / 0 / xx% a / 1 / xx% a / 0 / xx% -0- / 0 / xx%

10 UND: Utilization NPN-Class Diagram
Calculate Implementation Capability ab’c’+a’bc’ abc/ 1 / 50% ab’c’+a’bc’ / 1 / 75% abc ab’+a’b a ab’+a’b / 1 / 50% ab / 0 / 25% The topology property (DAG) of UND enables us to efficiently explore different metrics for functionality ranking, e.g., utilization rate. a / 1 / 25% -0- / 0 / xx% Fanout cone of ab’c+a’bc’

11 Recap: Overall Flow for Macro-Gate Design
h b a c LUT and2(3) inv(1) nand2(2) …… Map with LUT-N Extract logic functions Generate Utilization NPN Diagram ab’c’+a’bc’ / 1 / xx% ab / 0 / xx% a / 0 / xx% ab’+a’b / 0 / xx% -0- / 0 / xx% abc/ 1 / xx% ab’+a’b / 1 / xx% a / 1 / xx% Calculate score For logic functions a / 1 / 25% ab’+a’b / 1 / 50% ab’c’+a’bc’ / 1 / 75% ab / 0 / 25% -0- / 0 / xx% abc/ 1 / 50% 1+1*1/2=1.5 1 1*1/2=0.5 1+1*1/3=1.33 1+1*2/3+1*1/3=2 Now let’s recap the flow for macro-gate design. Given a LUT-mapped design, we first extract all logic functions, then build the utilization NPN-class diagram, after that we calculate the implementation capability, perform the logic ranking and select the best logic functions to build in macro-gates. Rank logic functions Best function: ab’c’+a’bc’

12 Proposed Macro-Gates and FPGA Architecture
For IWLS’05 benchmarks, the following four 6-input functions have the highest ranks GI1=a b c d e f (AND-6) GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e (MUX-4) GI3=a b' c d' e + b c e f + d e f GI4=a b' + a' c d' + b' c' + e' + f‘ It can implement over 50% of logic functions in IWLS’05 benchmarks. The architecture of the proposed macro-gate and FPGA SLICE are We perform the above procedure on IWLS’05 benchmarks, and select the following four best 6-input NPN-classes based on our logic ranking result. It’s very interesting that the first two are 6-input AND gate and 4-1 MUX, and the rest two are random logics. In our experiments, we find that over 50% of logic functions can be implemented by the combination of these four NPN-classes. The following is the proposed macro-gate design and the SLICE design. Basically, the four NPN-classes are fed into a 4-1 MUX with input/output negation. The SLICE includes one LUT and one Macro-gate and two FFs.

13 Outline Design of the Embedded Macro-gates
Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Place and Routing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Having designed the macro-gate, now I’ll describe our synthesis flow for the proposed architecture. The complete synthesis includes technology mapping, packing and place&route. In this work, I study the first two problems. Since the homogeneous SLICEs are assumed, we can use the traditional place&route algorithm, like VPR to perform the physical design. Below, I’ll start from technology mapping.

14 Functional & Structural Cut Enumeration
b d z y x c w b=y+wz a=(x+y)’ 4-input macro gate lib …… d=ab=(x+y)’(y+wz)=x’y’wz Yes Is x’v’wz in library? Phase1:Enumerate and label cuts from PIs to Pos Check the feasibility of a cut w.r.t. the macro-gate Phase2:Select best choice from POs to Pis A general yet efficient solution is SAT based Boolean matching Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping , Session 5C.1, ICCAD 07 For the technology mapping, we adapt the traditional cut-based algorithm to handle macro-gates. There are two phases of the algorithm, first we enumerate and label cuts from PI to PO and then select the best choice from PO to PI. In our implementation, we first store all logic functions that can be implemented by macro-gate in a lookup table. During the first phase, we will calculate the logic function of each cut and label it to macro-gate if it can be found in the lookup table. A general yet efficient solution for this is SAT based Boolean matching. We’ve a work will be presented in Session 5C.1 to discuss this issue.

15 Key in Technology Mapping: Balance Resource Utilization
Asymmetric architecture causes problem to resource utilization Exclusively use of one logic resource leads to lots of unused fabric Simple yet effective solution : Change LUT-MG ratio by adjusting their area weights. Precise calibration is hard to reach by this approach. Total# too large! Objective architecture: LUT6:MacroGate6 =1:1 Hard to obtain precise calibration Since the proposed new architecture is asymmetric, and exclusively use of one logic resource may lead to many unused fabric. Therefore, after the cut-based mapping, we need to balance the resource utilization for LUTs and macro-gates. One way to balance the resource is to adjust the area weight during the technology mapping by giving bias to one resource. As shown in this example. Given a architecture with one LUT and one macro-gate in a SLICE, the most area-efficient resource utilization ratio is 1:1. The table shows the resource utilization for different area weights for LUTs and macro-gates. It turns out that it’s hard to achieve the perfect resource balance by simply adjusting the area weight. Best LUT-MG ratio = 1:1 LUT-MG ratio = LUT#/MG#

16 Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay 5 / 5 9 / 13 LUT6 PO PI MG6 17 / 17 9 / 9 13 / 13 MG6 We propose to perform the post-mapping area recovery. Consider the following example. There are 6 Macro-gates and one LUTs. The goal is to re-map the circuit and achieve roughly equal number of macro-gates and LUTs without increasing the delay. MG6 MG6 4 / 5 MG6 MG6 8 / 9

17 Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay 5 / 5 10 / 13 LUT6 LUT6 PO PI 17 / 17 9 / 9 13 / 13 MG6 MG6 MG6 4 / 5 MG6 MG6 8 / 9

18 Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay Timing slack budgeting is necessary! 5 / 5 10 / 13 LUT6 LUT6 PO PI 18 / 17 9 / 9 14 / 13 MG6 To achieve this, we can perform the static timing analysis and greedily replace those non-critical macro-gates to LUT. However, it turns out that the timing constraint will be violated. MG6 MG6 5 / 5 LUT6 LUT6 Timing target violation! 10 / 9

19 Post Mapping Area Recovery by Timing Budgeting
Formulated as an Integer Linear Programming (ILP) Problem Objective (minimize gap between target and actual LUT-MG ratios): min |m2+…+m7-7/2| Arrival time constraints: ai+dj+bj<=aj Clock period target: ai<=17 LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1} Easy to be generalized to handle arch with multiple macro gates with different input pin numbers a1 a2 LUT6 MG6 PO PI a3 a5 a4 MG6 Therefore, we propose the following resource utilization balance algorithm based on binary integer and linear programming. A timing slack budgeting framework is used in our formulation. Our formulation is easy to be generalized to handle arch with multiple macro gates and with different input pin numbers. Please refer to the paper for the detail formulation in our algorithm.Please refer to the paper for the detail formulation in our algorithm. MG6 MG6 a6 MG6 MG6 a7

20 Outline Design of the Embedded Macro-gates
Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Having mapped the application to the proposed FPGA, I’ll propose a flexible packing algorithm based on SAT.

21 SAT-Based Packing Motivation
Traditional packing tools, e.g., T-VPack, hard-codes the architecture specification of a SLICEs…. Re-impalement from scratch when architecture changes Propose a unified implementation of the packers for different architectures: easy to perform architecture exploration! The architecture dependent sub-problem in packing Structural feasibility checking for a sub-circuit to the SLICE Solution Solve the problem of validating SLICE packing as a local place&route problem A SAT solver is used to carry out the validation checking The traditional packing tools such as T-Vpack hard-codes the architecture information of a SLICE and one needs to re-implement the packer from scratch when archtecture changes. Below, I’ll propose a unified implementation of the packers for different architectures: easy to perform architecture exploration! In fact, the architecture dependent sub-problem in packing is the structural feasibility checking for a sub-circuit to a SLICE. We can do it as a local place&route problem, which will be formulated and solved as a SAT problem.

22 Example of SAT-Based SLICE Packing
Examples of constraints: (for each classes of constraint…) Placement and routing choice variables: Exclusively constraint: ∨ Presence constraint: ∨ Input/Output constraint: → Routing constraint: G0 →out ∧ → Here is an example of how we solve the feasibility checking problem as a local place&route problem. Suppose the target SLICE is shown as figure (a) and we need to check if sub-circuit (b) can be fit into this SLICE. We first define new variables based on LUT placement options, for example LUT X can be placed at either A or B, so we define variable and Also we define variables based on routing options, for example, net U5 can be implemented in net N10 in the SLICE, so we have variable Then we define a few constraint to valid the packing. For example, LUT X can be placed in either A or B but not both, so we have the exclusively constraint, and presence constraint like these two. If LUT X is placed in site A, then its output U5 must be implemented by N10 in the SLICE, so we have the input/output constraint like this. And so force. Please refer to the paper for the complete list of the formulations.

23 Recap: Overall Synthesis Flow
g d e h b a c LUT Area weight Setting Cut-based Mapping LUT6 MG6 Area-Balance Trade-off? Y N Post-mapping Area recovery LUT6 MG6 Having presented the technology mapping and packing. Now let’s recap the overall synthesis flow for mixed macro-gates and LUTs based FPGA. We first define the delay and area cost for a LUT and a macro-gate, then perform the cut based technology mapping. If the resource (LUTs and macro-gates) utilization is not uniform (balanced). A resource balancing algorithm is performed to adjust the mapping result. Finally a packing procedure is conducted. packing LUT6 MG6

24 Outline Motivation and Objectives
Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Evaluation of Heterogeneous FPGA Architectures Conclusions and Future Work Below I’ll give the experimental results

25 Experimental Setting Benchmark set: IWLS 2005
Design library parameters [Cong, TODAES’05] Benchmark set: IWLS 2005 Four architectures are compared: LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate Synthesize the proposed macro-gate by SIS1.2 Delay and area model Interconnect delay is igonired We’ve implemented the mapper in Berkeley ABC and the packer by LISP language and miniSAT package. The design library used to evaluate the area and delay are obtained from the existing work and the benchmark set is IWLS’2005. We compare the following four architectures, they are LUT4-only, LUT4+macro-gate, LUT6-only, and LUT6+macro-gate. The proposed macro-gate is synthesized by Berkeley SIS and the delay and area model used in the experiments are shown in this table. Basically, the area and delay of the macro-gate are comparable to LUT-4 and 3x and 1x smaller than those of LUT-6.

26 Delay Comparisons Compared to LUT4, LUT4+MG reduces both logic depth and delay by 9.2%. Compared to LUT6, LUT6+MG reduces delay by 30% while increasing logic depth by 36.5%. A LUT6 can implement more logics than a macro-gate Let’s first take a look at the delay comparisons. The mixed LUT4 and macrogate architecture reduces both logic depth and delay by 9.2% compared to LUT4 only architecture. And the mixed LUT6 and macrogate architecture reduces logic delay by 30% while increases logic depth by 36.5%. The reason of the logic depth increase is due to the fact that a LUT6 can implement more logics than a macro-gate.

27 Logic Area Comparisons
Compared to LUT4, LUT4+MG reduces logic area by 12.5%. Compared to LUT6, LUT6+MG reduces logic area by 16.9%. For the logic area comparison, LUT4+MG architecture reduces logic area by 12.5% and LUT6+MG architecture reduces logic area by 16.9%.

28 Outline Motivation and Objectives
Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

29 Conclusions Conclusions Future Work
A novel FPGA architecture with the mixed LUTs and macro-gates is proposed A synthesis flow for the proposed architecture is implemented The preliminary experimental results show the effectiveness of the proposed architecture for the area and delay reduction Future Work Perform the physical design for the synthesized circuits and compare the routing costs, architecture evaluation considering interconnect delay Study the effectiveness of the power reduction for the proposed architecture Macro-gates with wider inputs will be examined In conclusion, we have presented a novel FPGA architecture with mixed LUTs and macro-gates, and proposed a set of synthesis tools. The results show the effectiveness of the propose architecture for area and delay reduction. In the future, we will perform the physical design for the synthesized circuits and compare the routing cost. Also we will study the the effectiveness of the power reduction for the proposed architecture. In addition, Macro-gates with wider inputs will be examined.


Download ppt "Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1"

Similar presentations


Ads by Google