Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.
2 Overview Motivation Motivation Timing criticality Timing criticality Restructuring for delay Restructuring for delay Algorithm Algorithm Experimental results Experimental results Conclusions Conclusions Future work Future work
3 Motivation AIG is an And-Inverter Graph AIG is an And-Inverter Graph AIG-based combinational logic synthesis is fast and effective AIG-based combinational logic synthesis is fast and effective AIG-based synthesis is area-oriented (except balancing) AIG-based synthesis is area-oriented (except balancing) Needed: Delay optimization in AIG-based synthesis Needed: Delay optimization in AIG-based synthesis AIGs allow for accumulation of structural choices [Lehman et al, TCAD’97; Chatterjee et al, ICCAD’05] AIGs allow for accumulation of structural choices [Lehman et al, TCAD’97; Chatterjee et al, ICCAD’05] Can leverage efficient technology mapper with choices Can leverage efficient technology mapper with choices Can lead to fast delay optimization (~10% of mapping time) Can lead to fast delay optimization (~10% of mapping time)
4 Distinctive Features Traditional approach Traditional approach For all timing-critical areas For all timing-critical areas Perform timing analysis Perform timing analysis Generate alternative structures Generate alternative structures Evaluate the improvement and decide is transformation is accepted Evaluate the improvement and decide is transformation is accepted Proposed approach Proposed approach Perform timing analysis only once Perform timing analysis only once For all timing-critical areas For all timing-critical areas Generate and store structural choices Generate and store structural choices Use technology mapper to pick and choose good structures Use technology mapper to pick and choose good structures Characteristics of the proposed approach Characteristics of the proposed approach Fast – because there is no repeated timing analysis Fast – because there is no repeated timing analysis Simple – because it leverages AIG package and LUT mapper Simple – because it leverages AIG package and LUT mapper Effective – because it makes decision in the global space Effective – because it makes decision in the global space
5 Timing Criticality Critical nodes Critical nodes Used by many traditional algorithms Used by many traditional algorithms Critical edges Critical edges Used by our algorithm Used by our algorithm We pre-compute critical edges of critical nodes We pre-compute critical edges of critical nodes Reduces computation Reduces computation An edge between critical nodes may not be critical An edge between critical nodes may not be critical See illustration: edge 1 3 See illustration: edge 1 Primary inputs Primary outputs
6 Delay-Oriented Restructuring Using traditional MUX-restructuring Using traditional MUX-restructuring AKA generalized select transform AKA generalized select transform
7 Overall Algorithm mapped netlist performSpeedup ( subject graph S, // S is an And-Inverter Graph subject graph S, // S is an And-Inverter Graph mapped netlist M, // M was previously derived by tech-mapping of S mapped netlist M, // M was previously derived by tech-mapping of S timing window w, // w is used to detect the critical paths timing window w, // w is used to detect the critical paths logic depth l, // l is used to detect a logic cone rooted at a node logic depth l, // l is used to detect a logic cone rooted at a node edge count p ) // p limits the number critical edges of the cone edge count p ) // p limits the number critical edges of the cone{ perform timing analysis of M with unit-delay or LUT-library model; perform timing analysis of M with unit-delay or LUT-library model; pre-compute critical section of M as nodes n such that 0 slack(n) w; pre-compute critical section of M as nodes n such that 0 slack(n) w; pre-compute timing-critical edges connecting these nodes; pre-compute timing-critical edges connecting these nodes; for each timing critical node n { for each timing critical node n { find cone C of M that extends l levels down from n; find cone C of M that extends l levels down from n; pick the set of timing-critical edges V feeding into C; pick the set of timing-critical edges V feeding into C; if the number of edges in V exceeds p, continue; if the number of edges in V exceeds p, continue; find logic cone C’ in S corresponding to C in M; find logic cone C’ in S corresponding to C in M; find variables V’ in S corresponding to V in M; find variables V’ in S corresponding to V in M; derive cofactors of the function of C’ w.r.t. variables in V’; derive cofactors of the function of C’ w.r.t. variables in V’; build multiplexer tree C’’ of the cofactors using variables in V’; build multiplexer tree C’’ of the cofactors using variables in V’; add structural choice C’= C’’ to the subject graph S; add structural choice C’= C’’ to the subject graph S; } return mapped netlist M’ derived by mapping subject graph S with added choices; return mapped netlist M’ derived by mapping subject graph S with added choices;}
8 Experimental Setup Implemented in ABC as command speedup Implemented in ABC as command speedup Used FPGA technology mapper if Used FPGA technology mapper if Verified the results using CEC engine cec Verified the results using CEC engine cec Experiments targeting 6-LUTs were run on an Intel Xeon 2-CPU 4-core computer with 8Gb RAM. Experiments targeting 6-LUTs were run on an Intel Xeon 2-CPU 4-core computer with 8Gb RAM. Experimentally compared the following scripts Experimentally compared the following scripts Without delay-optimization: Without delay-optimization: (st; dchoice; if -C 16 -F 2) 8 (st; dchoice; if -C 16 -F 2) 8 With delay-optimization: With delay-optimization: (st; dchoice; if -C 16 -F 2) 4 (st; dchoice; if -C 16 -F 2) 4 (speedup; if -C 16 -F 2) 3 (speedup; if -C 16 -F 2) 3 (st; dchoice; if -C 16 -F 2) 4 (st; dchoice; if -C 16 -F 2) 4
9 Examples of LUT Libraries A variable-pin-delay LUT library A variable-pin-delay LUT library The unit-delay LUT library The unit-delay LUT library A variable-pin-delay LUT library with wire-delays A variable-pin-delay LUT library with wire-delays LUT size LUT area LUT pin delays
10 Experimental Results LUT – number of LUTs Lev – number of LUT levels Delay – delay using LUT library Total – total runtime of Baseline Time1 – the runtime of AIG restructuring only Time2 – the total runtime of Speeup Geomean – geometric averages of columns Ratios – ratios of geometric averages
11 Conclusions and Future Work Developed a method that is Developed a method that is Fast – because there is no repeated timing analysis Fast – because there is no repeated timing analysis Simple – because it leverages AIG package and LUT mapper Simple – because it leverages AIG package and LUT mapper Effective – because it makes decision in the global space Effective – because it makes decision in the global space Future work may include Future work may include measuring improvements after place-and-route measuring improvements after place-and-route extending the algorithm to work for sequential circuits extending the algorithm to work for sequential circuits applying similar optimization for cost functions other than delay (e.g. switching activity minimization) applying similar optimization for cost functions other than delay (e.g. switching activity minimization)