DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

Slides:



Advertisements
Similar presentations
Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.
Advertisements

ECE 667 Synthesis & Verificatioin - FPGA Mapping 1 ECE 667 Synthesis and Verification of Digital Systems Technology Mapping for FPGAs D.Chen, J.Cong, DAOMap.
FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
NTHU-CS 1 Performance-Optimal Clustering with Retiming for Sequential Circuits Tzu-Chieh Tien and Youn-Long Lin Department of Computer Science National.
➢ Performing Technology Mapping and Optimization by DAG Covering: A Review of Traditional Approaches Evriklis Kounalakis.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
Combining Technology Mapping and Retiming EECS 290A Sequential Logic Synthesis and Verification.
1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.
Technology Mapping.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.
EDA (CS286.5b) Day 3 Clustering (LUT Map and Delay) N.B. no lecture Thursday.
DAG-Aware AIG Rewriting Alan Mishchenko, Satrajit Chatterjee, Robert Brayton Department of EECS, University of California Berkeley Presented by Rozana.
CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
A Probabilistic Method to Determine the Minimum Leakage Vector for Combinational Designs Kanupriya Gulati Nikhil Jayakumar Sunil P. Khatri Department of.
Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 3: January 27, 2008 Clustering (LUT Mapping, Delay) Please work preclass example.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 17: March 30, 2009 Clustering (LUT Mapping, Delay)
Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,
FPGA Technology Mapping Algorithms
FPGA Technology Mapping. 2 Technology mapping:  Implements the optimized nodes of the Boolean network to the target device library.  For FPGA, library.
CDCTree: Novel Obstacle-Avoiding Routing Tree Construction based on Current Driven Circuit Model Speaker: Lei He.
1 A Method for Fast Delay/Area Estimation EE219b Semester Project Mike Sheets May 16, 2000.
Power Reduction for FPGA using Multiple Vdd/Vth
POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
-1- UC San Diego / VLSI CAD Laboratory Construction of Realistic Gate Sizing Benchmarks With Known Optimal Solutions Andrew B. Kahng, Seokhyeong Kang VLSI.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
1 EECS 219B Spring 2001 Timing Optimization Andreas Kuehlmann.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.
FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimisation in Lookup- Table Based FPGA Designs 04/06/ Presented by Qiwei Jin.
1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 3: January 12, 2004 Clustering (LUT Mapping, Delay)
Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich.
Give qualifications of instructors: DAP
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
High-Performance Global Routing with Fast Overflow Reduction Huang-Yu Chen, Chin-Hsiung Hsu, and Yao-Wen Chang National Taiwan University Taiwan.
Technology Mapping with Choices, Priority Cuts, and Placement-Aware Heuristics Alan Mishchenko UC Berkeley.
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
IPR: In-Place Reconfiguration for FPGA Fault Tolerance Zhe Feng 1, Yu Hu 1, Lei He 1 and Rupak Majumdar 2 1 Electrical Engineering Department 2 Computer.
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong , Computer Science Department , UCLA Presented.
Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design Jingyu Xu, Xianlong Hong, Tong Jing, Yici.
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.
Placement and Routing Algorithms. 2 FPGA Placement & Routing.
Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.
Reducing Structural Bias in Technology Mapping
Delay Optimization using SOP Balancing
Applying Logic Synthesis for Speeding Up SAT
Reconfigurable Computing
Standard-Cell Mapping Revisited
SAT-Based Area Recovery in Technology Mapping
Alan Mishchenko University of California, Berkeley
Alan Mishchenko UC Berkeley
Improvements in FPGA Technology Mapping
Delay Optimization using SOP Balancing
Fast Min-Register Retiming Through Binary Max-Flow
Presentation transcript:

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California, Los Angeles This work is partially supported by the California MICRO program and the NSF Grant CCR

Outline Introduction Related Works Definitions and Problem Formulation Algorithm Description Cut Enumeration Delay and Area Propagation Cost Function for a Cut Global and Local Cost Adjustments Iterative Cut Selection Experimental Results Conclusions and Future Work

Introduction Field Programmable Gate Array (FPGA) has become increasingly popular Fast to market No or very low NRE (non-recurring expenses) The LUT-based FPGA architecture dominates the existing programmable chip industry FPGA technology mapping converts a given Boolean circuit into a functionally equivalent network comprised only of LUTs FPGA technology mapping is a crucial optimization step in the FPGA design flow

Related Works on FPGA Mapping Area Minimization Chortle-crf, [Francis, et al, DAC’91] MIS-pga, [Murgai, et al, ICCAD’91] Praetor, [Cong, et al, FPGA’99] Anti-fuse FPGA Mapper, [Kang, et al, ASPDAC’04] Delay Minimization DAG-Map, [Chen, et al, DTC’92] FlowMap, [Cong, et al, ICCAD’92] Edge-map, [Yang, et al, ICCAD’94] Power Minimization PowerMinMap, [Li, et al, ASPDAC’03] Emap, [Lamoureux, et al, ICCAD’03] DVmap, [Chen, et al, FPGA’04] Simultaneous Delay and Area Minimization FlowMap-r, [Cong, et al, TVLSI’94] CutMap, [Cong, et al, FPGA’95] BoolMap-D, [Legl, et al, DAC’96]

Definitions DAG : a Boolean network Cone C v : a sub-network rooted on a node v K-feasible cone : |input(C v )|  K Fanin Cone F v : the largest C v K-feasible cut : A K-feasible C v Occupies a K-LUT Unit delay model : One LUT contributes one unit delay No edge delay a b c d e v FvFv 3-feasible cone C v PIs Delay of 2

Problem Formulation Delay-optimal Area Optimization problem Given: a Boolean network; an integer K Goal: cover the network with K-feasible cones (K-LUTs), such that Optimal mapping depth Area (number of LUTs) is minimized NP-hard problem on area minimization

Highlights of Our Algorithm Consider potential node duplications and make mapping-area estimation close to reality Search solution space considering both global and local optimality information Carry out an iterative cut selection procedure on top of cost adjustment to further improve solution quality Each technique used is simple and intuitive The key is the right combination of them

New cut Cut Enumeration a b d z yx c w a b d z yx c w Combine sub-cuts on the inputs of the gate Process each gate in topological order from PIs to POs Subcut Another Subcut

Complexity Analysis Number of cuts on a node for the worst case is O(n K ) Practically, it is a small constant for small K Average over 20 largest MCNC benchmarks

Delay and Area Propagation a c d yx z b w e f g Delay 1, Area 1 Delay = 1 Area = 1 Delay = 2 Area = 2 Delay 1, Area 1 Delay 2, Area 3 Delay 2, Area 2 Delay = 1 Area = 1 Delay = 1 Area = 1 Propagation process visits cuts and nodes iteratively The longest best delay on the POs is the optimal mapping delay

Area Estimation A C =  [A i / f(i)] + U C i = input(C) A i : estimated area of the fanin cone on signal i f(i) : fanout number of i U c : area of the cut itself Try to estimate area considering fanout effect Praetor, [Cong, et al, FPGA’99] Can under-estimate the area because of node duplications qr s p nmo t u Cut C t Cut C u f(p) = 2 ApAp Cut C A s / 2

C3C3 fanin1fanin2 Cost (Area) Function of a Cut Some Key parameters I C : cutsize of C N C : number of nodes covered by C f(v): fanout number of the root node v P f : duplication cost a b c d e v C1C1 C2C2

Duplication Cost Adjustment Consider potential node duplications Check the sub-cuts for multiple fanouts Propagate adjusted cost globally Subcut C f2 N Cf2 = 1 Multiple fanouts New cut C I C = 4 q r s Subcut C f1 p nmo Duplication Cost:  N Cf : number of nodes the subcut C f contains  I C : cutsize of C

Non- critical LUT Critical LUT Cut Selection – Mapping Generation From POs to PIs Critical paths optimal delay + best area available Non-critical paths relaxed delay + better area a c d yx z b w e f g

Techniques for Better Cut Selection Cut selection equivalent to min-cover problem Greedy approach will not work well Use heuristics to guide the selection Iterative Cut Selection Procedure Local Cost Adjustment Input Sharing Slack Distribution Cut Probing

Iterative Cut Selection (ICS) Some valuable information on area is unknown until after mapping mapped LUT root nodes duplicated nodes ICS carries out multiple mapping iterations Start Mapping Iteration i, i++ Profiling data Adjust Cut Cost i < threshold Exit if i = threshold

Local Cost Adjustment – Input Sharing Takes advantage of existing resources Considers roots from previous iterations The more a cut shares inputs with others, the better for the cut d e f g Become LUT roots Share inputs with existing LUTs Duplicated node

Local Cost Adjustment – Slack Distribution Slack C = Req v – 1 – MAX (Arr i ) i  input(C) If Slack C < 0, C is not a timing_feasible cut The larger the Slack C, the better for C in terms of slack distribution effect a c d yx z b w Largest arrival time among inputs Req d : Required time of the root C

Local Cost Adjustment – Cut Probing Probe the amount of area gain locally before making decisions about a cut Reduce connections between LUTs Reduce potential node duplications based on previous duplication profiling Reconvergent paths handling Use C final to guide cut selection

Experimental Results – Settings DAOmap is implemented using C language within the UCLA RASP system Compare LUT counts and runtime to CutMap [Cong et al, FPGA’95] Use a 750 MHz SunBlade-1000 Solaris machine Test on LUT input numbers from 4 to 6 Benchmarks 20 largest MCNC benchmarks A set of large industrial benchmarks

Experimental Results of DAOmap over CutMap on MCNC Benchmarks Average Area ReductionAverage Run Time Improvement 4-LUT-13.98%13.2X 5-LUT-16.02%24.2X 6-LUT-12.44%4.7X After mapping After mapping + packing (daomap + mpack) vs. (“cutmap –x” + mpack) Average Area ReductionAverage Run Time Improvement 4-LUT-7.50%57.7X 5-LUT-11.31%38.7X 6-LUT-7.90%10.1X

Detailed Experimental Results on Industrial Benchmarks CutMap DAOmap Comparison Bench marks LUT No. Run Time (s) LUT No. Run Time (s) LUT (Reduce) Run Time (Improve) big %3.2 big2->10H big %272.9 big %3.7 big5->10H big %35.9 Ave %78.9X After mapping into 5-LUTs

Individual Technique Analysis Techniques% dropped Cut Enumeration Min-cost propagation4.35% Global cost adjustment2.68% Cut Selection Input sharing4.55% Iterative cut selection (ICS)2.04% Others<1%

Mapping Iteration Analysis 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% Mapping Iterations Improvement %  For single iteration only (the base case), use manual profiling [Chen et al, FPGA’04]  When the iteration number is more than 3, it is no longer helpful

Conclusions and Future Work We presented a new mapping algorithm, DAOmap, to minimize FPGA delay and area We built several cost-adjustment heuristics and used an iterative mapping procedure DAOmap gained significant amount of area and runtime reduction over a state-of-the-art algorithm CutMap Future works include adding cut-pruning techniques for mapping with larger K values