Alan Mishchenko UC Berkeley

Slides:

Advertisements

Similar presentations

ECE 667 Synthesis & Verificatioin - FPGA Mapping 1 ECE 667 Synthesis and Verification of Digital Systems Technology Mapping for FPGAs D.Chen, J.Cong, DAOMap.

Advertisements

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

ECE 667 Synthesis and Verification of Digital Systems

Combining Technology Mapping and Retiming EECS 290A Sequential Logic Synthesis and Verification.

1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.

Technology Mapping.

Introduction to Logic Synthesis Alan Mishchenko UC Berkeley.

DAG-Aware AIG Rewriting Alan Mishchenko, Satrajit Chatterjee, Robert Brayton Department of EECS, University of California Berkeley Presented by Rozana.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

Enumeration of Irredundant Circuit Structures Alan Mishchenko Department of EECS UC Berkeley UC Berkeley.

1 Stephen Jang Kevin Chung Xilinx Inc. Alan Mishchenko Robert Brayton UC Berkeley Power Optimization Toolbox for Logic Synthesis and Mapping.

Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.

Wenlong Yang Lingli Wang State Key Lab of ASIC and System Fudan University, Shanghai, China Alan Mishchenko Department of EECS University of California,

Technology Mapping with Choices, Priority Cuts, and Placement-Aware Heuristics Alan Mishchenko UC Berkeley.

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

A Semi-Canonical Form for Sequential Circuits Alan Mishchenko Niklas Een Robert Brayton UC Berkeley Michael Case Pankaj Chauhan Nikhil Sharma Calypto Design.

Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.

Reducing Structural Bias in Technology Mapping

Synthesis for Verification

Technology Mapping into General Programmable Cells

Power Optimization Toolbox for Logic Synthesis and Mapping

Alan Mishchenko UC Berkeley

Mapping into LUT Structures

Delay Optimization using SOP Balancing

Faster Logic Manipulation for Large Designs

SAT-Based Logic Optimization and Resynthesis

Robert Brayton Alan Mishchenko Niklas Een

Alan Mishchenko Satrajit Chatterjee Robert Brayton UC Berkeley

Introduction to Logic Synthesis with ABC

Magic An Industrial-Strength Logic Optimization, Technology Mapping, and Formal Verification System Alan Mishchenko UC Berkeley.

Logic Synthesis: Past, Present, and Future

A Semi-Canonical Form for Sequential AIGs

Applying Logic Synthesis for Speeding Up SAT

Integrating an AIG Package, Simulator, and SAT Solver

Synthesis for Verification

Standard-Cell Mapping Revisited

Faster Logic Manipulation for Large Designs

LUT Structure for Delay: Cluster or Cascade?

SAT-Based Area Recovery in Technology Mapping

Polynomial Construction for Arithmetic Circuits

Alan Mishchenko University of California, Berkeley

SAT-Based Optimization with Don’t-Cares Revisited

Scalable and Scalably-Verifiable Sequential Synthesis

Mapping into LUT Structures

ECE 667 Synthesis and Verification of Digital Systems

FPGA Glitch Power Analysis and Reduction

Alan Mishchenko UC Berkeley (With many thanks to Donald Knuth,

Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow

Alan Mishchenko UC Berkeley (With many thanks to Donald Knuth for

Integrating an AIG Package, Simulator, and SAT Solver

Introduction to Logic Synthesis

Improvements in FPGA Technology Mapping

Technology Mapping I based on tree covering

Recording Synthesis History for Sequential Verification

Logic Synthesis: Past, Present, and Future

Delay Optimization using SOP Balancing

Logic Synthesis: Past and Future

Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow

Magic An Industrial-Strength Logic Optimization, Technology Mapping, and Formal Verification System Alan Mishchenko UC Berkeley.

Innovative Sequential Synthesis and Verification

Robert Brayton Alan Mishchenko Niklas Een

SAT-based Methods: Logic Synthesis and Technology Mapping

Fast Min-Register Retiming Through Binary Max-Flow

Introduction to Logic Synthesis with ABC

Robert Brayton Alan Mishchenko Niklas Een

Alan Mishchenko Department of EECS UC Berkeley

Integrating AIG Package, Simulator, and SAT Solver

Presentation transcript:

Alan Mishchenko UC Berkeley FPGA Technology Mapping for LUTs, LUT Structures, and Programmable Cells Alan Mishchenko UC Berkeley

Overview Introduction Technology mapping into LUTs Priority cuts Structural choices Mapping into LUT structures Mapping into complex programmable cells Conclusion

(1) Introduction Terminology And-Inverter Graphs

Terminology Logic network Structural cut of a node Primary inputs/outputs (PIs/POs) Logic nodes Fanins/fanouts Transitive fanin/fanout cone (TFI/TFO) Structural cut of a node Cut is a boundary in the network separating the node from the PIs Boundary nodes are the leaves The node is the root K-feasible cut has K or less leaves Function of the cut is function of the root in terms of the leaves Primary inputs Primary outputs Fanins Fanouts TFO TFI Primary inputs Leaves Root Cut

AIG Definition and Examples AIG is a Boolean network composed of two-input ANDs and inverters. cdab 00 01 11 10 1 F(a,b,c,d) = ab + d(ac’+bc) b c a d 6 nodes 4 levels cdab 00 01 11 10 1 F(a,b,c,d) = ac’(b’d’)’ + c(a’d’)’ = ac’(b+d) + bc(a+d) a c b d 7 nodes 3 levels

(2) Technology Mapping Traditional LUT mapping Delay-optimal mapping Area recovery Drawbacks of the traditional mapping Excessive memory and runtime Structural bias Ways to mitigate the drawbacks Priority cuts Structural choices

Traditional LUT Mapping Algorithm Input: And-Inverter Graph Compute K-feasible cuts for each node Compute best arrival time at each node In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Perform area recovery Using area flow Using exact local area Chose the best cover In reverse topological order (from PO to PI) Output: Mapped Netlist

Delay-Optimal Mapping Cut size K = 3 Input: AIG and K-cuts computed for all nodes Algorithm: For all nodes in a topological order Compute arrival time of each cut using fanin arrival times Select one cut with min arrival time Set the arrival time of the node to be the arrival time of this cut Output: Delay-optimal mapping for all nodes f 3 Cut {pqr} of node f has arrival time 3 s r p q 1 1 2 1 a b c d e f f 2 s Cut {stu} of node f has arrival time 2 1 t u 1 1 a b c d e f

Area Recovery During Mapping Delay-optimal mapping is performed first Best match is assigned at each node Some nodes are used in the mapping; others are not used Arrival and required times are computed for all AIG nodes Required time for all used nodes is determined If a node is not used, its required time is set to +infinity Slack is a difference between required time and arrival time If a node has positive slack, its current best match can be updated to reduce the total area of mapping This process is called area recovery Exact area recovery is exponential in the circuit size A number of area recovery heuristics can be used Heuristic area recovery is iterative Typically involved 3-5 iterations Next, we discuss cost functions used during area recovery They are used to decide what is the best match at each node

How to Measure Area? Suppose we use the naïve definition: Area (cut) = 1 + [ Σ area (fanin) ] (assuming that each LUT has one unit of area) y x x y p q r p q r a b c d e f a b c d e f Area of cut {pcd} = 1 + [1 + 0 + 0] = 2 Area of cut {abq} = 1 + [ 0 + 0 + 1] = 2 Naïve definition says both cuts are equally good in area Naïve definition ignores sharing due to multiple fanouts

Area-flow Area-flow “correctly” accounts for sharing area-flow (cut) = 1 + [ Σ ( area-flow ( fanin ) / fanout_num( fanin ) ) ] y x x y p q r p q r a b c d e f a b c d e f Area-flow of cut {pcd} = 1 + [1 + 0 + 0] = 2 Area-flow of cut {abq} = 1 + [ 0/1 + 0/1 + ½] = 1.5 Area-flow recognizes that cut {abq} is better Area-flow “correctly” accounts for sharing (Cong ’99, Manohara-rajah ’04)

Exact Local Area Exact-local-area (cut) = 1 + [ Σ exact-local-area (fanin with no other fanout) ] f f p p 6 6 6 6 q q s t s t a b c d e f a b c d e f Cut {pef} Area flow = 1+ [(.25+.25+3)/2] = 2.75 Exact area = 1 + 0 (p is used elsewhere) Exact area will choose this cut. Cut {stq} Area flow = 1+ [.25+.25 +1] = 2.5 Exact area = 1 + 1 = 2 (due to q) Area flow will choose this cut.

Area Recovery Summary Area recovery heuristics Area-flow (global view) Chooses cuts with better logic sharing Exact local area (local view) Minimizes the number of LUTs by looking one node at a time The results of area recovery depends on The order of processing nodes The order of applying two passes The number of iterations Implementation details This scheme works for the constant-delay model Any change off the critical path does not affect critical path

Drawbacks of Traditional Mapping Excessive memory and runtime requirements Exhaustive cut enumeration leads to many cuts (especially when K  6) Structural bias The structure of the object graph does not allow good mapping to be found

Excessive Memory and Runtime For large designs, there may be too many K-feasible cuts 1M node AIG has ~50M 6-cuts Requires ~2GB of storage memory and takes ~30 sec to compute Past ways of tackling the problem Detect and remove dominated cuts Does not help much Perform cut pruning (store N cuts/node) Throws away useful cuts even if N = 1000 Store only cuts on the frontier Reduces memory but increases runtime k Average number of cuts per node 4 6 5 25 50 7 120 8 250

Structural Bias Consider mapping 4:1 MUX into 4-LUTs The naïve approach results in 3 LUTs After logic structuring, mapping with 2 LUTs can be found

Ways to Mitigate the Drawbacks Excessive memory and runtime requirements Compute only a small number of “useful” cuts Leads to mapping with priority cuts Structural bias Perform mapping over multiple circuit structures Leads to mapping with structural choices

(3) Priority Cuts Structural cuts Exhaustive cut enumeration Prioritizing cuts Implementation tricks

Structural Cuts in AIG n A cut of a node n is a set of nodes in transitive fanin such that every path from the node to PIs is blocked by nodes in the cut. A k-feasible cut has no more than k leaves. p q a b c The set {pbc} is a 3-feasible cut of node n. (It is also a 4-feasible cut.) k-feasible cuts are important in LUT mapping because the logic between root n and the cut leaves {pbc} can be replaced by a k-LUT.

Exhaustive Cut Enumeration { p, ab } { q, bc } { a } { c } { b } a b c p q n { n, pq, pbc, abq, abc } Computation is done bottom-up The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children. Any cut that is of size greater than k is discarded. (P. Pan et al, FPGA ’98; J. Cong et al, FPGA ’99)

Experiment with K-Cut Computation C/N is the number of cuts per node; T is time in seconds; L/N is the ratio of nodes with the number of cuts exceeding the limit (N=1000); for K < 8, the number of cuts did not exceed 1000

Computing Priority Cuts Consider nodes in a topological order At each node, merge two sets of fanin cuts (each containing C cuts) resulting in (C+1) * (C+1) + 1 cuts Sort these cuts using a given cost function, select C best cuts, and use them for computing priority cuts of the fanouts Select one best cut, and use it to map the node Sorting criteria The tie-breaking criterion denoted “fanin refs” means “prefer cuts with larger average fanin reference counters”.

Priority Cuts: A Bag of Tricks Compute and use priority cuts (a subset of all cuts) Dynamically update the cuts in each mapping pass Use different sorting criteria in each mapping pass Include the best cut from the previous pass into the set of candidate cuts of the current pass Consider several depth-oriented mappings to get a good starting point for area recovery Use complementary heuristics for area recovery Perform cut expansion as part of area recovery Use efficient memory management

Priority-Cut-Based Mapping Input: And-Inverter Graph Compute K-feasible cuts for each node Compute arrival time at each node In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Compute at most C good cuts and choose the best one Perform area recovery Using area flow Using exact local area In each iteration, re-compute at most C good cuts and choose the best one Chose the best cover In reverse topological order (from PO to PI) Output: Mapped Netlist

Complexity Analysis The worst-case complexity of traditional mapping FlowMap O(Kmn) (J. Cong et al, TCAD ’94) CutMap O(2KmnK) (J. Cong et al, FPGA ’95) DAOmap O(KnK) (J. Cong et al, ICCAD’04) Mapping with priority cuts O(KC2n) K is max cut size C is max number of cuts n is number of nodes m is number of edges

(4) Structural Choices Structural bias Ways to overcome structural bias Need some form of (re)synthesis to get multiple circuit structures Computing and using several synthesis snapshots Running several scripts and combining the resulting networks Performing Boolean decomposition during mapping Multiple circuit structures = structural choices Questions: How to efficiently detect and store structural choices? How to perform technology mapping with structural choices?

Structural Bias The mapped netlist very closely resembles the subject graph f f p LUT p Technology Mapping LUT m m LUT a b c d e a b c d e Every input of every LUT in the mapped netlist must be present in the subject graph - otherwise technology mapping will not find the match

Example of Structural Bias A better match may not be found f f This match is not found p LUT p f LUT LUT q m m LUT LUT a b c d e a b c d e a b c d e Since the point q is not present in the subject graph, the match on the right is not found

Example of Structural Bias The better match can be found with a different subject graph f a b c d f e p m p f synthesis LUT  q q LUT a b c d e a b c d e

Synthesis with Structural Choices Traditional synthesis produces one “optimized” network Synthesis with choices produces several networks These can be different snapshot of the same synthesis flow These can be results of synthesizing the design with different options For example, area-oriented and delay-oriented scripts Synthesis D1 D2 D3 D4 Synthesis with structural choices D1 HAIG D4 D2 D3

Mapping with Structural Choices Two questions have to be answered How to store multiple circuit structures? How to perform mapping with multiple circuit structures? Both questions can be solved due to the following: The subject graph is an AIG Structural hashing quickly merges isomorphic circuit structures There are powerful equivalence checking methods They can be used to prove equivalence Cut computation can be extended to work with structural choices The modification is straight-forward

Detecting Choices Given two Boolean networks, create a network with choices Network 1 x = (a + b)c y = bcd Network 2 x = ac + bc y = bcd Step 1: Make And-Inverter decomposition of networks x y x y a b c d a b c d

Detecting Choices Step 2: Use combinational equivalence to detect functionally equivalent nodes up to complementation (A. Kuehlmann, TCAD’02) Random simulation to detect possibly equivalent nodes SAT-based decision procedure to prove equivalence Network 1 x = (a + b)c y = bcd Network 2 x = ac + bc y = bcd x y x y a b c d a b c d

Detecting Choices Step 3: Merge equivalent nodes with choice edges x y b c d x y a b c d x y x now represents a class of nodes that are functionally equivalent up to complementation a b c d

Cut Computation with Choices Cuts are now computed for equivalence classes of nodes { x1, pr, pbc, acr, abc } { x2, qc, abc } x y x1 x2 p q r a b c d Cuts ( x ) = Cuts ( x1 )  Cuts( x2 ) = { x1, pr, pbc, acr, abc, x2, qc }

Mapping Algorithm with Choices Only Step 1 has to be changed Input: And-Inverter Graph with choices Compute K-feasible cuts with choices Compute best arrival time at each node In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Perform area recovery Using area flow Using exact local area Chose the best cover In reverse topological order (from PO to PI) Output: Mapped Netlist

Mapping into LUT Structures

Introduction Delay optimization is a high priority Can be addressed by CAD algorithms FPGA hardware We propose a two-fold solution Improved LUT mapping algorithm Modification to the hardware 38

Two LUT Structures 7-input LUT structure “44” 39

Contributions Developed of an efficient matching algorithm to check whether a given Boolean function can be implemented using a given LUT structure Modified the priority-cut-based technology mapper in ABC to perform mapping into the LUT structures Evaluated new algorithm and new LUT structures Collected statistics on implementable K-input Boolean functions appearing in industrial designs 40

Experimental Setup Mapping into dedicated hardware Improved traditional mapping Baseline (if) and mapping with structural choices (MSC) (dch; if -j)4 # k area delay 1-4 1.00 1.00 44: Runs (dch; if -j)4 with LUT library: 5-7 2.00 2.00 444: Runs (dch; if -j)4 with LUT library: 5-10 3.00 2.00 Best 444: Runs (dch; if -j)4 with LUT library: 5-6 2.00 2.00 7 2.50 2.00 8-10 3.00 2.00 Mapping into dedicated hardware 1.20 41

Experimental Results Summary Table 4.1. Improvements to the traditional FPGA mapping. Table 4.2. Delay-optimization using dedicated FPGA architecture with direct connections between adjacent LUTs. 42

Ratios of Implementable Functions “44” structure 4-input – 100% 5-input – 99.99% 6-input – 99% 7-input – 84% “444” structure 5-input – 100% 6-input – 99.99% 7-input – 97.6% 8-input – 94.5% 9-input – 75.5% 10-input – 39.7% 43

Mapping into General Programmable Cells

Overview Boolean matching is used in synthesis for FPGAs To map into current programmable cells To research future cell architectures Boolean matching methods are often specialized, slow, and require manual effort We propose a new Boolean matcher, which is General (works for many types of cells) Fast (due to the use of concurrency) Automatic (does not require manual tuning) Experiments show it is useful in practice 45

Boolean Matching Given a programmable cell and a Boolean function, find if a cell can be programmed to implement a given function Programmable cell Function that can be implemented: F = AND(a, b, c, d, e) Function that cannot be implemented: F = XOR(a, b, c, d, e) e a b c d

Small Practical Functions Classifications of Boolean functions Random functions Specialized classes of functions Symmetric, unate, etc Logic synthesis and technology mapping deal with Functions appearing in the designs Functions having small support (up to 16 variables) We call them small practical functions (SPFs) We concentrate on SPFs in this work

The Proposed Matching Flow Pre-computation Collect SPFs appearing in the design(s) Input and pre-process cell description Perform Boolean matching concurrently for all SPFs and save data into a file During mapping Perform additional Boolean matching step Allow only matchable functions to be used

Programmable Cells Used

Area Comparison

Delay and Runtime Comparison

Summary Reviewed improved LUT mapping Starts with an optimized AIG (with choices) Performs exhaustive (or priority) cut computation Performs heuristic area recovery Uses placement-aware heuristics Presented extentions of LUT mapping Mapping into LUT structures Mapping into programmable cells

FPGA Technology Mapping for LUTs, LUT Structures, and Programmable Cells Abstract: A structural technology mapper is a computation engine used in EDA tools. The mapper takes a simple circuit structure, such as an and-inverter graph, and a technology library. The mapper produces a structural cover of the circuit in terms of the library primitives while optimizing cost functions, such as area, delay, power, etc. This talk describes the design of an efficient cut-based technology mapper and its applications in the design flow for FPGAs. We will discuss three types of primitives: (1) k-input lookup tables (LUTs), which can implement any k-input logic function, (2) n-input structures composed of several k-LUTs, and (3) n-input single-output programmable cells composed of several k-LUTs and other cells such as MUXes and AND/OR gates, where k is typically 4 or 6, while n can be up to 16. Bio: Alan Mishchenko graduated from Moscow Institute of Physics and Technology (Moscow, Russia) in 1993 with MS and received his PhD from the Glushkov Institute of Cybernetics (Kiev, Ukraine) in 1997. In 2002, he joined the EECS Department at UC Berkeley, where he is currently a full researcher. Alan’s research interests are in developing computationally efficient methods for synthesis and verification.