Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.

Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by the National Science Foundation under grant CCF-0530261

Variation and its effects  Environmental Variation  Causes: overheating and voltage fluctuations  Addressed (in part) by: cooling and better power supplies  Process Variation  Causes: dopant density, edge geometry, stress during manufacturing, and much more  Addressed (in part) by: Adding a slack of as much as 3-sigma for delay variation  Data Variation  Causes: output stabilization varying greatly between different data  Addressed by: Highly restrictive asynchronous designs and the Razor architecture  Solutions  Speed Binning and more accurate estimates Only deals with process variation Only deals with process variation  Variable Clocking (Razor Architecture) Deals with all 3 variations! Deals with all 3 variations! Intra-die variations in ILD thickness

High performance circuits  Worst-case delay minimization  Hitting a wall due to feature size limits Can’t keep up with Moore’s law Can’t keep up with Moore’s law Conservative timing due to variation Conservative timing due to variation  Typical case delay minimization  Defined: Delay for expected data to propagate through circuit Usually much smaller than worst-case delay Usually much smaller than worst-case delay  Harder to optimize circuits Change thinking about circuit optimization Change thinking about circuit optimization  Requires special architecture, like the Razor Architecture (MICRO ’03) UCLA VLSICAD LAB3

Razor flip-flop implementation Error comparator RAZOR FF Main Flip-Flop clk clk_del Shadow Latch Q Logic Stage L1 Logic Stage L2 Error_L 0 1 D Slide borrow from Razor (MICRO ’03) presentation  Main flip-flop  Clocked faster than worst-case delay  Shadow Latch  Clocked with delayed clock to catches any errors  Error  Occurs when main flip-flop and shadow latch differ  Next clock cycle, the Shadow latch value moves into the Main Flip-Flop 4

Razor timing error detection  Second sample of logic value used to validate earlier sample  Key design issues:  Maintaining forward progress  Short path impact on shadow-latch  Overhead of error detection and correction Main FF Shadow Latch Main FF clk clk_del 5 4 9 MEM 39 9 5 Slide borrow from Razor (MICRO ’03) presentation

Razor (output) registers Razor (state) registers Razor (input) registers FSM to Razor transformation  Possible to convert most circuits to Razor Stallable buffer State registers Combinational logic Output registers Razor (stabilization) registers Input registers FSM combinational logic Data Data Valid Data Ready Razor Blackbox 6

Problem formulation  Definitions  Maximum depth Shadow latch (worst-case delay) Shadow latch (worst-case delay)  Target depth Main (overclocked) flip-flip Main (overclocked) flip-flip  Performance Measuring  Can’t use the clock due to errors!  Errors due to overclocking (any switching between target depth and max depth)  So we have to use Expected Delay instead of Delay ExpDelay(using target depth d ) = d (Pr(NO error | using clock d)) ExpDelay(using target depth d ) = d (Pr(NO error | using clock d)) + ( d + time recover ) (Pr(error | using clock d)) + ( d + time recover ) (Pr(error | using clock d))  Find d Linear search! Linear search! BestExpDelay = min(ExpDelay( d ) | max_depth/2 ≤ d < max_depth) BestExpDelay = min(ExpDelay( d ) | max_depth/2 ≤ d < max_depth) UCLA VLSICAD LAB7

Optimization goals  The expected delay  Total delay for data to propagate and recovery from any errors Reduce probability of an error Reduce probability of an error  Straight forward, if we are given a target depth Minimize probability of switching after target depth Minimize probability of switching after target depth  What can we do without the target depth?  We try to get the switching to occur as early as possible  Extra area overhead  Hard to compare solutions (special cost function is needed) Clock Switching Activity UCLA VLSICAD LAB8

BTWMap algorithm overview Decompose into 2-input gates Simulate 256 random input values over all cuts Assign cost based on switching and depth Choose cuts to minimize cost Save the scaled simulation data for next iteration Cut Selection 9 400 times Area Recovery Target clock assignment Area/performance tradeoff Done!

UCLA VLSICAD LAB Cut selection  Cut ranking  Can’t look at just switching activity for each depth  For example, cut 2 is better than cut 1  Expired simulation data  Keep the old data  Assume previous iteration’s costs after still valid but scale them down  Allows the algorithm to converge on a solution Keeping the old data, decreases Pr(error) by an average of 3.5% Keeping the old data, decreases Pr(error) by an average of 3.5% Huge improvement since for us Pr(error) <= 5% Huge improvement since for us Pr(error) <= 5% Depth Probability of switching Cut 1Cut 2 33%4% 250%5% 170% 10

2 1 3 3 UCLA VLSICAD LAB BTWMap - area recovery (target depth)  Idea  Find a target depth How much to overclock How much to overclock  Ignore the switching the happens below this depth  Implementation  Set outputs’ target depth  Select cuts PO->PI while propagating the target depth Works similar to worst case depth but calculated Works similar to worst case depth but calculated PO->PI using MIN instead PI->PO using MAX  Benefits  Moderate reduction in area Target depth 2 2 1 1 0 0

UCLA VLSICAD LAB BTWMap – area-performance tradeoff  Idea  Relax the minimum switching cost of each gate  Give area recovery techniques room to work  Implementation  Set outputs to the initial amount they can be relaxed  Make a relaxation and propagate what your inputs can change using: Depth of the inputs Depth of the inputs How much switching slack is left How much switching slack is left Input to output switching correlation Input to output switching correlation u For example, Pr(y switching|x 1 switched)=75% while Pr(y switching|x 2 switched)=50% while Pr(y switching|x 2 switched)=50%  Benefits  Accurate relaxation estimates  Large reduction in area 12

BTWMap results example  BTWMap mapping comparison  Test circuit is PDC from the MCNC benchmark suit  Comparing 4 methods  A. Depth optimal mapping with depth relaxation on non-critical paths for area saving  B. Depth optimal mapping without depth relaxation  C. BTWMap  D. BTWMap with area recovery. 13

UCLA VLSICAD LAB What circuits can’t be optimized  Maximum Razor clock = (max depth)/2  Already good = switching < 2% at maximum Razor clock  Very low switching at maximum Razor clock  4 of the MCNC suite  Too bad = switching > 90% at max depth  All the switching happens at the very last depth. Very hard to optimize. Have to reduce the switching activity a minimum of 20x at that depth  5 of the MCNC suite  Easy to test and exclude  Map using ABC and checking switching probabilities 14

UCLA VLSICAD LAB Sample results  The example below is for the MCNC benchmark SEQ ABC BTWMap BTWMap+area ABC BTWMap BTWMap+area Area 1000Area1258Area1111 Max Depth 6 6 6 Depth Switch. Prob. Prob. Error Expected Delay Depth Switch. Prob. Prob. Error Expected Delay Depth Switch. Prob. Prob. Error Expected Delay 545.49 7.7353.85 5.2354.99 5.30 450.0395.529.73417.5321.385.28447.0452.027.12 372.51100.009.00345.9167.307.04352.80100.009.00 Ave Delay 5.09 Best Pipeline Delay 6.00 Ave Delay3.80 Best Pipeline Delay 5.23 Ave Delay4.28 Best Pipeline Delay 5.30 15

Results – expected delay and area  Performance improvement: 13% with BTWMap and 11% after area recovery  The area recovery saves over 16% of the lost area  In the best case (ignoring switching), we’re still 3% away from ABC  Trading 7% for much better switching activity Expected DelayRatioAreaIncrease CircuitABCBTWMap BTWMap +area BTWMap BTWMap +area ABCBTWMap BTWMap +area BTWMap BTWMap +area alu47.06.36.490%92%72292277628%8% apex26.86.46.694%97%9721200104024%7% apex47.06.1 87% 793102884230%6% clma8.48.28.097%95%42165674509135%21% misex36.0 100% 74289981921%10% pdc8.56.57.277%85%22342934249831%12% s2982.32.0 87% 44504914%11% s384178.06.0 75% 29093972323437%11% s38584.16.45.3 83%84%35924543385627%7% seq6.04.95.381%88%10001258111126%11% spla7.36.56.789%92%21272735239929%13% Geomean 87.0%88.8% 26.4%10.1% 16

UCLA VLSICAD LAB Conclusion  BTWMap work includes:  Methodologies for measuring performance on circuits optimized for average case delay.  Algorithms for optimizing circuits for average case delay.  Implementation and release these tools (alpha version) http://cadlab.cs.ucla.edu/software_release/btwmap/  Results Summary  BTWMap (and the area recovery version) 14% (and 8%) average delay reduction 14% (and 8%) average delay reduction 13% (and 11%) pipeline improvement 13% (and 11%) pipeline improvement 26% (and 10%) area increase 26% (and 10%) area increase 17

UCLA VLSICAD LAB18

Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.

Similar presentations

Presentation on theme: "Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.

Similar presentations

Presentation on theme: "Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University."— Presentation transcript:

Similar presentations

About project

Feedback