xPilot A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx.
2 Outline u Motivation u xPilot system framework u Overview of the synthesis engine Scheduling Resource binding u Experimental results
3 Motivation (1) u Design Complexity is outgrowing the traditional RTL method Feasible to build SoC device with 500M transistors; Billion- transistor chips are on the horizon Behavioral synthesis a critical technology for enabling the move to higher level of abstraction Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of a solid RTL foundation Lack of consideration of physical reality Lack of consideration of physical reality
4 Motivation (2) u Behavioral Synthesis provides combined advantages Better complexity management Code size: RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] Code size: RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] Shorter verification/simulation cycle Simulation speed 100X faster than RTL-based method Simulation speed 100X faster than RTL-based method Rapid system exploration Quick evaluation of different hardware/software boundaries Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Fast exploration of multiple micro-architecture alternatives Higher quality of results Full consideration of physical reality Full consideration of physical reality
5 xPilot: Platform-Based Behavioral to RTL Synthesis Flow Behavioral spec. in C/SystemC RTL SSDM u Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … u Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … FPGAs/ASICs Frontend compiler Platform description u Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding
6 System-level Synthesis Data Model u SSDM (System-level Synthesis Data Model) Hierarchical netlist of concurrent processes and communication channels Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics Port / IO interfaces, bit-vector manipulations, cycle-level notations Port / IO interfaces, bit-vector manipulations, cycle-level notations
7 Platform Modeling & Characterization u Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Memories: registers, synchronous memories, etc. Chip layout description On-chip resource distributions On-chip resource distributions On-chip interconnect delay/power estimation On-chip interconnect delay/power estimation
8 Scheduling Goals u A highly versatile scheduling engine Applicable to a wide range of application domains Computation-intensive, data/memory-intensive, control-intensive, etc. Computation-intensive, data/memory-intensive, control-intensive, etc. Mixed behavioral & RTL Mixed behavioral & RTL Amenable to a rich set of scheduling constraints Data dependency constraints Data dependency constraints Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. Retargetable to a variety of design objectives High performance, small area, low power, etc. High performance, small area, low power, etc.
9 Scheduling Optimization Capabilities u Offers a variety of optimization techniques in a unified framework Combinational/Sequential non-pipelined/pipelined multi-cycle operation Unconditional/Conditional operation chaining Relative scheduling Considerations of branching probabilities and repetitions Multi-cycle communication (under development) Code motion & speculation (under development) Functional / loop pipelining (under development) Physical layout integration (to be supported)
10 Scheduling Current Status u Design objective Focus on high-performance designs u Overall approach Use a system of pairwise difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function The system is immediately solvable via any linear programming solver with integral solutions
11 Scheduling Design Framework xPilot scheduler STG (State Transition Graph) System of pairwise difference constraints Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Constraint equations generation Objective function generation CDFG Linear programming solver LP solution interpretation User- specified design constraints& assignments Target platform modeling (resource library & chip layout)
12 Example : Greatest Common Divisor u GCD C description x = inport1; y = inport2; while (x != y) { if ( x > y ) x = x – y; else y = y – x; } *outport = x; x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5
13 Constraints Generation u Data dependency constraint Operation v is data dependent on operation u, i.e., (u, v) E s(v) – s(u) 0 where schedule variable s(v) represents the relative schedule of node v Other constraints can be represented in a similar way … u The constraint equations form a system of pairwise difference constraints Matrix A is totally unimodular Feasibility check can be formulated as a single-source shortest path problem Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem u: x_1 = (x_0, x_1, x_2); v: cond2 = (x_1 > y_1);
14 Solution by LP Solver x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5 0 1 u Scheduling are performed across the basic block boundaries
15 Schedule Interpretation x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2); *outport = x_3; if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0);
16 Deriving State Transition Graph u Final STG for GCD x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } cond3 || cond4
17 Unified Resource Binding u Provides an unified resource sharing framework to optimize for various design objectives Simultaneous functional unit binding, register binding and port binding Equipped with advanced techniques to optimized the interconnect and steering logic networks Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc. Extendable to exploit physical layout information
18 An FU/Register binding Example u Observations: Binding has large impact to the resulting performance and cost Functional unit and register binding are highly correlated Note : Assume all operations and variables are compatible for sharing
19 Drawbacks of Previous Work u Many existing algorithms focus on functional-unit- or register- “number” minimization Technology advances – interconnect effect increasing 51% of the total dynamic power of a microprocessor in 0.13um tech. 51% of the total dynamic power of a microprocessor in 0.13um tech. Up to 80% of the dynamic power in future technologies Up to 80% of the dynamic power in future technologies May generate larger amount of multiplexers and interconnects Unfavorable performance and cost results u Optimization for unrealistic goals Minimize “number” of FUs, registers, or multiplexors Should have detailed datapath models to guide the optimization Should have detailed datapath models to guide the optimization No technology specific consideration Should have platform-specific characterizations Should have platform-specific characterizations
20 xPilot architecture exploration Iteration Resource Binding in xPilot No Yes Register Allocation/Binding FU Allocation/Binding Baseline Register Binding Improved?? STG (State Transition Graph) User- specified design constraints Target platform (resource library & chip layout) Datapath model for performance-cost estimation STG + Best Datapath Models
21 Design Space Exploration MUL Datapath for solution (1, 2, 4) (3) power delay pruned A State Transition Graph (STG) u Exploration phases: Exploring Node 2: (1) (2) two mul (1) (2) two mul (1, 2) one mul (1, 2) one mul Exploring Exploring Node 3: (1) (2) (3) three mul (1, 2) (3) two mul (1, 3) (2) two mul Exploring Exploring Node 4: (1) (2) (3) (4) (1, 2, 4) (3) (1, 2) (3, 4) (1, 2) (3) (4) (1, 3, 4) (2) (1, 3) (2, 4) (1, 3) (2) (4) …. C1 ’ C1 C2 C2’ > 1* 2*, 3* 4* 5* 6+ < 1* 2* 5* 3* 4* 6+ > < Compatible Graphs Datapath Model Curve for Design Space Pruning
22 Experimental Results Benchmark Suite u Benchmark suite PR, MCM: DSP kernels: pure additions/subtractions and multiplications DSP kernels: pure additions/subtractions and multiplications CACHE Cache controller: control-intensive designs with cycle-accurate I/O operations Cache controller: control-intensive designs with cycle-accurate I/O operations MOTION: Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations IDCT: JPEG inverse discrete cosine transform: computation intensive JPEG inverse discrete cosine transform: computation intensive DWT: JPEG2000 discrete wavelet transform: computation intensive with modest control flow JPEG2000 discrete wavelet transform: computation intensive with modest control flow EDGELOOP: Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses
23 Experimental Results Code Size Reduction
24 Experimental Results Comparison with SPARK On Scheduling DesignsTool/Flow Synthesis Report Altera Quartus II report state#reg# fmax (MHz) LEregistermemdsp MOTION spark xpilot PR spark xpilot , IDCT spark176~ ,8474, xpilot ,4815, xpilot-mem ,3516,0981,02464 CACHE spark Memory unsupported xpilot-mem u SPARK [UCI/UCSD, 2004], a state of the art academic high- level synthesis tool
25 u On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark Designs SPARKxPilot F max Ratio xPilot/ SPARK Resource UsageF max Resource UsageF max LECOMB Lonely- Reg Comb- Reg DSP (MHz)LECOMB Lonely- Reg Comb- Reg DSP(MHz) PR WANG LEE MCM DIR FEIG Total Ave Ratio n/a*2.96n/a*2.48 Experimental Results Comparison with SPARK On Binding
26 Synthesis Results for DWT (JPEG2000) Target cycle time State#fmax(MHz)Cycle# Latency (ns) LE#DSP# 9ns ns ns u Settings Target platform: Altera Stratix RTL synthesis & place-and-route: Altera QuartusII v5.0 Simulation: Mentor ModelSim SE6.0 u Design alternatives
27 Experimental Results: ASIC Flow u u Magma RTL to GDSII flow u u Technology library: Cadence Generic Standard Cell Library 0.18um u u Tradeoff study: 1 st column: delay constraint enforced in xPilot 2 nd column: control step count of xPilot generated RTL 3 rd -5 th column: data reported after mapping by Magma tool DIR State # Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns ns ns ns ns
28 Experimental Results: ASIC Flow (cont.) LEEState# Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns ns ns ns ns MotionState# Area(u2)Delay(ps)Fmax(MHz)Latency(ps)10ns ns ns ns