Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department, UCLA + Microprocessor Development and Research Center, PKU
Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion
Motivation High-level languages Big gap between design productivity and complexity Alleviate the design complexity Need to produce high-quality products Need to consider multi-bitwidth Recent research shows there are 40% redundant bits in programs of high-level languages [Stephenson et al, SIGPLAN’00] Hardware resource cost will be reduced with consideration of multi- bitwidth Area is proportional to input bitwidth for adders and registers, and is proportional to the square of input bitwidth for multipliers Wire-length is reduced accordingly Conventional high-level synthesis only focuses on resources with uniform bitwidth
Motivational Example - Impact of Bitwidth Adders * (3 clock cycles) + (1 clock cycle) Execution time: 8 clock cycles ** ** *6 16*4 24*1632* Multipliers 32x1624x * * * * Adders * * * * *6 16*4 24*16 32* Multipliers 32x1618x * * * * 30% saving 31% saving
Related Works High-level synthesis with consideration of bitwidth ILP formulation [Constantinides et al, IEEE Electronics Letters’00] Heuristic solution [Kum et al ’01] [Constantinides et al, DATE’01] Split adders into 1-bit [Molina et al DAC’02] Partially guarded computation [Choi et al, ISLPED’00] Limitation No consideration of interconnect delay in scheduling and binding Interconnect delays dominate the timing in DSM tech Interconnect delays dominate the timing in DSM tech No optimality evaluation of proposed solutions for register allocation and binding
Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion
Bitwidth-Aware Synthesis Flow Multiple bitwidth scheduling and binding problem Given: (1) A DFG annotated with bitwidths, (2) a time constraint, (3) placement information of functional units, and (4) a resource IP library, where each resource type has arbitrary bitwidth configurations, each of which is associated with an area cost. Objective: Schedule and bind the DFG into the library with consideration of interconnect delay from placement and without violating the time constraint, such that the final area of the required resources is minimized.
RDR+MCAS Global Interconnect … LCC … … … … … FSM K cycles 1 cycle 2 cycles Register file IslandIsland Local Computational Cluster (LCC) Local Computational Cluster (LCC) …. Register File WiWi HiHi FSM ALU MUL Cluster with area constraint 1 cycle2 cycle K cycle MUX One solution for multi-cycle on-chip communication Regular Distributed Register (RDR) micro-architecture [Cong et al, ISPD ’ 03] [Cong et al, ICCAD ’ 03] The whole chip is divided into an array of islands Chose the island size such that local computation and communication in each island can be done in a single cycle MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis with physical planning Placement information of functional units
Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion
Scheduling and Binding Lower bound estimation of FU bitwidth for a DFG Prior works focus on the number of FUs Lower-bound-based simultaneous scheduling and binding Time constrained Consider the interconnect delay obtained from placement information given by MCAS
Lower Bound Estimation Extend the interval-based technique of [Sharma et al, 93] to support multi-bitwidth FUs Main idea Compute the minimum resource requirement R(p, q) for each time interval [p,q] [1,T] The maximum of R(p, q) over all intervals is the final bitwidth lower bound
Example of Lower-bound Estimation The minimum bitwidth requirement for multipliers in interval [4, 7] Theorem: For any feasible scheduling, the minimum overlap between operation o and interval [p,q] is: O(o, p, q) = min{ | Lifetime_ASAP [p, q] |, | Lifetime_ALAP [p, q] | } The operation bitwidths that must be executed in [4,7] is {18, 24, 24, 32, 16} The minimum bitwidth requirement for multipliers in [4,7] will be R (4, 7)={32, 16} The minimum overlap between the multiplications, a, b, c and d, and interval [4,7] O(a 18*6, 4, 7) = 1 O(b 24*16, 4, 7) = a*a* d*d* b*b* c*c* *6 16*4 24*16 32*16 16 step1 step2 step3 step4 step5 step6 step7 step a*a* d*d* b*b* c*c* *6 16*4 24*16 32*16 16 ASAP ALAP O(c 32*16, 4, 7) = 1 O(d 16*4, 4, 7) = 1 Sorted: {32, 24, 24, 18, 16} a*a* a*a* c*c* c*c* d*d* d*d* b*b* b*b*
Area Cost Weighted-area lower bound of an unscheduled DFG is defined as area for adders area for multipliers a ratio weight of multiplier area over adder area For a partially scheduled DFG, scheduling status S records the control steps for scheduled operations and feasible control steps for un-scheduled operations A is calculated the same way, denoted as A(S)
Scheduling and Binding Algorithm-1 Goal: Minimize the area cost of required FUs Consider interconnect delay Basic idea In each step, schedule an operation at a control step such that the resulted weighted-area lower bound A(S) is kept as small as possible A(16,1) = 48 add-32: feasible control step [2,3] A(16,2) = 48 A(32,2) = 64 A(32,3) = 48 step1 step2 step3 add-16: feasible control step [1,2] 16 add-32: feasible control step [2,3] A(32,2) = 64 A(32,3) = How to choose an operation and one of its feasible control step
Scheduling and Binding Algorithm-2 Simultaneous scheduling and binding with consideration of interconnect delay After operation o and c is chosen, FU binding is performed to decide whether o can be scheduled at step c finally There is an available FU usable by o at step c Data dependence between o and its scheduled and bound predecessors and successors is maintained 16 step1 step2 step3 * + MUL ADD 1 clock cycle island +
Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion
Register Allocation and Binding Problem formulation Given: A scheduled DFG annotated with bitwidth Objective: Perform register allocation and binding to minimize the total bitwidth of registers Register allocation Decide the minimum required registers Register binding Explicitly map variables to register instances
Preliminaries Scheduled DFG Life times of variables Lifetime of a variable s(o): the control step where variable o is produced e(o): the last control step where variable o is consumed Weighted interval graph A proper coloring of G corresponds to a register allocation and binding scheme Weight of a coloring scheme The weight of color c W(c) = max{w(v) | v is colored with c } The weight of the coloring scheme P is defined as W(G, P) = W(c) = 58
Coloring Problem Weighted-interval-graph coloring problem Given: A weighted interval graph G(V, E) Objective: Find a coloring scheme P of G, such that the weight of the coloring scheme P, W(G, P), is minimized Uniform weights Be solved in polynomial time (Left-edge) Various weights The complexity remains unknown We propose a lower-bound estimation and an efficient algorithm
Lower-Bound Estimation |C 24 | |C 18 | 1 16 |C 16 | 2 5 |C 5 | 3 Bitwidth lower bound 24*1+16*1+5*1=45 Scheduled DFG Life times of variables
Coloring Algorithm Weight of coloring 24*1+16*1+5*1= Scheduled and bound DFGLife times of variables
Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion
Experimental Results -Weighted Interval-Graph Coloring DesignsLower BoundLeft-Edge+PostProcess [Kum et al ’ 01] Weighted IGC aircraft chem dir honda lee mcm pr u5ml wang Ave gap-+6.6%+7.5%+0.05%
Experimental Results -Three Synthesis Flows Flow1 (MCAS) MCAS generates the scheduling and binding results and placement information. All operations and variables have uniform bitwidth (32-bits). Flow2 (MCAS+MB-PP) Perform a bitwidth post-processing after Flow1 is done, which is to set the bitwidth of a FU as the maximum bitwidth of all operations executed on it, and set the bitwidth of a register as the maximum bitwidth of all variables stored in it. Flow3 (MCAS-MB) After MCAS generates the scheduling and binding results and placement, the lower-bound-based scheduling & binding and the bitwidth-aware register allocation and binding are performed. Share the same backend to generate datapath and controllers Altera’s Quartus II version is used to synthesize the resulting RTL VHDL onto the FPGA device StratixTM EP1S80F1508C6
Experimental Results -Comparison of the Three Synthesis Flows Design Node# MCASMCAS+MB-PPMCAS-MB LEWL(k)LEWL(k)LEWL(k) aircraft chem dir honda lee mcm pr u5ml wang Ave Red %-34.5%-36.3%-51.5% LE: Area results for datapath and control logic in terms of logic element WL: Wire-length
Conclusions We presented a complete bitwidth-aware high-level synthesis flow based on MCAS synthesis system Experimental results Our bitwidth-aware synthesis flow achieves significant reduction for area and wire-length
Reference J. Choi, J. Jeon and K. Choi, “Power Minimization of Functional Units by Partially Guarded Computation,” Proc. of ISLPED, 2000 J. Cong, Y. Fan, X. Yang, and Z. Zhang, “Architecture and Synthesis for Multi-Cycle Communication,” Proc. Of International Symposium on Physical Design, J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, "Architecture and Synthesis for On-Chip Multicycle Communication," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2004 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal Datapath Allocation for Multiple-Wordlength Systems,” IEEE Electronics Letters, 2000 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristic Datapath Allocation for Multiple Wordlength Systems,” Proc. of Design, Automation and Test in Europe (DATE), 2001 K. Kum and W. Sung, “Combined Word-Length Optimization and High-Level Synthesis of Digital Signal Processing Systems,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, 2001 M. C. Molina, J. M. Mendias, and R. Hermida, “High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length,” Proc. of the 39th Design Automation Conference, 2002 A. Sharma and R. Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applications,” IEEE Trans. on VLSI Systems, 1993 M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” Proc. of the ACM SIGPLAN'2000 Conference on Programming Language Design and Implementation, 2000