ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs Give qualifications of instructors: DAP teaching computer architecture at Berkeley since 1977 Co-athor of textbook used in class Best known for being one of pioneers of RISC currently author of article on future of microprocessors in SciAm Sept 1995 RY took 152 as student, TAed 152,instructor in 152 undergrad and grad work at Berkeley joined NextGen to design fact 80x86 microprocessors one of architects of UltraSPARC fastest SPARC mper shipping this Fall
Overview Logic synthesis LUT Clustering LUT capacity Chortle – example technology mapper Architecture-specific optimization
Boolean network A Boolean network is the main representation of the logic functions for technology independent optimizations. Each node can be represented as sum-of- products (or product-of-sums). Provides multi-level structure, but functions in the network need not correspond to logic gates.
Boolean network example out1 = k2 + x2’ out2 = k3 + x1 k2 = x1’ x2 x4 + k1 k3 = k1 x4’ k1 = x2 + x3 x1 x2 x3 x4 primary outputs primary inputs
Support: set of variables used by a function. Terms Support: set of variables used by a function. Transitive fanout: all the primary outputs and intermediate variables of a function. Transitive fanin: all the primary inputs and intermediate variables used by a function. Transistive fanin determines a cone of logic. cone primary inputs output
Partially-specified function x1 x2 x3 1 don’t care
Network restructuring. Delay restructuring. Optimizations Simplification. Changing the way a function is represented. Network restructuring. Adding and removing nodes. Delay restructuring. Optimizations that reduce the height of critical paths.
Partial collapsing f1 f4 F f4 f2 f3 f3 before after
Technology mapping Cover the function:
FPGA tech mapping Cost (number of inputs) doesn’t always increase with added functions:
Cost metric for static gates is literal: FPGAs vs. custom logic Cost metric for static gates is literal: ax + bx’ has four literals, requires 8 transistors. Cost metric for FPGAs is logic element: All functions that fit in an LE have the same cost.
LUT-based logic synthesis Find the largest logic cone that will fit into the LUT: r = q + s’ s = d’ q = g’ + h d = a + b
How much fits in a LUT? One 2-input NAND gate frequently used for comparison. Approximately 12 ~ 15 gates per four-input LUT. 216 functions -> 80 after IO swapping 14 after IO inversion 4-input determined to be optimal [Rose 1990] A B C D A B C D
Technology-Independent Logic Optimization Improve circuit based on cost Keep same functionality Boolean Evaluation/decomposition Simple factoring -> minimizing literals f = ac + ad + bc + bd g = a + b + c e = a + b g = e + c f = e(c + d)
Factorization Based on division: formulate candidate divisor; test how it divides into the function; if g = f/c, we can use c as an intermediate function for f. Algebraic division: don’t take into account Boolean simplification. Less expensive then Boolean division.
Library-based Technology Mapping – MIS II Three steps: decomposition, matching, covering Circuit first decomposed into NAND representations Different collections of NANDs can be implemented differently in VLSI Inv, cost 2 NAND2, cost 3 AOI-21, cost 4
MIS II Cost = Decompose into NAND-2 using Boolean techniques Use dynamic programming to match subtrees with libraries Choose lowest cost implementation that covers all primitives.
Tech Mapping for LUTs Minimize total number of LUTs Minimize the number of levels of LUTs Many different approaches Partitioning -> Flowmap BDDs -> XMAP Chortle -> Covering Basic Xilinx tech mapping follows Chortle with modification to handle registers.
Chortle-crf Secondary goal Dynamic programming approach Minimize # LUTs – primary goal Minimize # input circuit root uses Secondary goal Operates on AND-OR circuits. A B C D E F w x G H I J K L M y z Locate boundaries
Chortle-crf Major innovation is bin packing Simultaneously addresses decomposition and matching Goal: Find decomposition of every node in the network that minimizes # LUTs in final circuit Without decomp 4-LUTs With decomposition 2-LUTs
Mapping Each Tree Dynamically visit each node in the graph Fanin nodes drive the node under evaluation Boxes -> fanin LUTs, cost is number of inputs Bins -> N input LUT (in this case 5) First Fit Decreasing /* construct 2-level decomp */ box list <- fanin LUTs sorted by size bin list <- 0 while (box list is not 0) { box <- largest LUT find bin that will contain LUT if bin doesn’t exist bin <- box /* create new bin */ else bin <- box /* pack in exisiting */
Multi-Level Decomposition Chain LUTs together Output of largest second level LUT connected to LUT with unused input May need to add a new LUT Leads to min LUTs and fanout LUT with smallest # input This fanout LUT used as input to next stage
b) Two-level Decomposition Examples a) Fanin LUTs u v w x y b) Two-level Decomposition y x z.2 z.1 w v u y u v w x z.1 c) Multi-level Decomposition
Optimality For LUTs with fewer than 6 inputs Chortle will create an optimal result for subtree Combination of sub-trees is not optimized. Local optimizations needed to ensure global optimality. Reconvergent paths -> net drives multiple gates. Replicating logic -> creating additional fanout
Translating a Design to an FPGA Improve 2-level decomposition to take fanout into account Replace FFD with an exhaustive search that repeatedly invokes FFD. Try both with and without reconvergent path and select best mapping (forced merging) Inputs must reconverge at node being decomposed.
Reconvergent Paths Frequently, more than one pair of fan-in LUTs share inputs For each combination of pairs that share inputs, perform FFD. Two-level decomp with fewest bins and smallest least filled bin retained Reconverge pair list <- all pairs of fanin LUTs with shared inputs best LUTs <- 0 for all possible pairs from pair list { merged LUTs <- copy of fanin LUTs with forced merge FFD(merged LUTs) /* best combo */ }
Maximum Share Decreasing Exhaustive search prohibitive Select box using following criteria Greatest # inputs Shares greatest # inputs with any existing bin Shares greatest # of inputs with existing (remaining) boxes Reduces to FFD for no input sharing Points 2 and 3 optimize network sharing
Node Replication Without Replication With Replication Apply replication to fanout nodes Map without replication first Locally decompose fanout nodes to determine savings Ordering important
Results – Chortle-crf 20 netlists mapped to 5-input LUTs Reconvergence reduced LUTs by 2.7% Replication reduced LUTs by 3.7% Combined 14% reduction achieved Replication exposes reconvergent paths creating additional opportunities for optimization.
Chortle-d Minimize delay through circuit Generally increases hardware required Reduced logic levels by 38% Increased # LUTs by 79% Note most delay in FPGA in interconnect
Other Approaches MIS-PGA Groups inputs into LUTs Decompose into 4-LUTs (Roth-Karp) 47 times slower than Chortle 14% fewer LUTs XMAP Represent circuit as BDDs Effective for multiplexer based devices. Also, BDS-PGA
Flowmap 1. Use network flow to partition circuit. 2. Determine point where minimum flow achieved for minimum cut 3. Cut until LUTs of size N achieved.
Taking Flip flops into Account FPGA devices contain fixed resources – FFs Technology mapping should take these into account Consider fanout nodes. FF
LUT Packing - VPACK Seed BLE – choose BLE with most inputs. Select next BLE -> BLE which shares most inputs and outputs with cluster Continue until cluster is full or adding any BLE will overflow I -> # inputs Hill Climbing – exceed I limit temporarily to find better minimum.
Summary Many tech mapping algorithms exist to minimize delay/area Chortle use dynamic programming heuristic to perform mapping Largely a solved problem More sophisticated techniques evaluated recently