Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko Niklas Een Hamid Savoj Robert Brayton University of California, Berkeley
Outline Motivation The flow Experimental results Conclusion Technology-independent synthesis Technology mapping Buffering Sizing Experimental results Conclusion
Motivation Synthesis tools are out there, but they are slow suboptimal complicated expensive
ABC It is a public-domain tool developed by our research group since 2005 It addresses both synthesis and verification of synchronous hardware It is based on years of experience in developing efficient data-structures and algorithms It is used in industry and academia For more information, visit https://bitbucket.org/alanmi/abc
The Flow Technology-independent synthesis Technology mapping Buffering Sizing These steps are not disconnected; they overlap Synthesis talks to mapping through structural choices Mapping talks to buffering through fanout estimations Buffer and sizing can be interleaved
Synthesis: Old and New “AIG rewriting” Delay/area costs Restructuring AND2 levels/nodes Restructuring for all 4-input cuts, try all AIG subgraphs, choose the one with the min nodes under delay constraint Results Acceptable quality Acceptable runtime Problems “Over-re-structuring” Slow for large, deep logic “AIG reshaping” Delay/area cost user-specified cost for n-input AND/XOR/MUX/MAJ Restructuring iterate “mapping” and “unmapping” several times Results Comparable quality 3-10 faster Problems None so far
Mapping: Old and New “Traditional” cut-based mapping iterate over the subject graph re-compute priority cuts use structural or functional matching (ICCAD’97) For standard-cell mapping use a gain-based library map both (pos and neg) phase of each node into gates select best cuts (gates) Results Acceptable quality Tolerable runtime “Improved” cut-based mapping pre-compute priority cuts iterate over the subject graph evaluate cuts using different costs use structural or functional matching For standard-cell mapping use a gain-based library map into NPN classes of functions from the library select best cuts (NPN classes) perform phase-assignment and determine gates during buffering Results Quality not known yet Runtime is expected 3-10x faster
Buffering: Old and New Several ideas tried, none is a clear winner Enumerating buffer tree topologies Buffering for near-continuous libraries Other incremental local fanout optimization methods Several ideas tried, none is a clear winner “Technology-independent” buffering after the gain-based library Buffer-tree construction given required times and loads of the fanouts Incremental buffering interleaved with incremental sizing Results are mixed
Incremental Buffering Illustrated Growing Bypassing
Sizing: Old and New Non-linear programming Linear programming Lagrangian multipliers Incremental sizing find critical region find best gates to resize perform the resizing incrementally update timing Iterate until no improvement Can be combined with incremental buffering Results Reasonable Surprisingly fast If an optimum solution is known, seems to converge to it
Commands of The Flow read_lib write_lib print_lib read_scl write_scl dump_genlib print_gs stime buffer unbuffer minsize maxsize upsize dnsize print_buf read_constr print_constr reset_constr
Experimental Setting 19 OpenCore designs were synthesized and mapped by an industrial tool using public library vsclib013.lib from http://www.vlsitechnology.org/ Delay, area, and runtime were collected and used as a reference Sizing was tested by applying min-sizing, followed by re-sizing Buffering was tested by un-buffering and min-sizing, followed by re-buffering and re-sizing The flow was tested by restructuring the design, followed by mapping, buffering, and sizing
Experimental Results
Comments on The Table Column “Gate” shows the number of gates produced by the industrial tool Other columns “Gate” show the percentage of change in the number of gates after reach transform, compared to the result produced by the industrial too. Positive is improvement. Negative is degradation. Similarly, columns “Area” and “Delay” show the percentage of change in area and delay, respectively. The flows are tuned differently This is why the area increase after buffering/sizing is more than after synthesis/buffering/sizing. Runtimes are in seconds on an old desktop computer On a new computer, the runtimes are expected to be 2x smaller
Potential Issues Not specifying input driving cells and output loads This was addressed and experiments show it is fine Over-tuning for one particular library Not sure heuristics will hold for submicron libraries Not looking at power Not taking high and low Vt cells into account Not mapping into multi-output cells Not mapping sequential elements Not considering multiple clock domains
Conclusion A new synthesis flow is being developed and implemented in ABC An opportunity to rethink some of the classical problems improve on some of the known solutions come up with a new public implementation Results are encouraging delay (in delay-oriented synthesis) is within 5-15% area (in area-oriented synthesis) is within 1-3% runtime is about 20-50x better
Abstract This presentation focuses on adding new capabilities to synthesize standard cell designs in the public-domain synthesis/verification tool ABC. An optimization flow has been developed, which included gain-based technology mapping, fanout-optimization by buffering and gate duplication, and gate-sizing. Novel heuristic algorithms have been proposed for several well-known optimization steps. For example, buffer tree construction can be performed not as a separate step, but concurrently with gate-sizing by reshaping initial well-balanced buffer trees. Each tree reshaping and each gate resizing transform are evaluated for delay/area improvement using a common cost-function and the most promising one is selected. The delay is measured by lookup table based delay model, which computes the delay of a gate from its input flew and output capacitance. Experiments show that the flow produces results that are 10% within those of industrial tools 20x faster.