Presentation is loading. Please wait.

Presentation is loading. Please wait.

XPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005.

Similar presentations


Presentation on theme: "XPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005."— Presentation transcript:

1 xPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx.

2 2 Outline u Motivation u xPilot system framework u Overview of the synthesis engine  Scheduling  Resource binding u Experimental results

3 3 Motivation (1) u Design Complexity is outgrowing the traditional RTL method  Feasible to build SoC device with 500M transistors; Billion- transistor chips are on the horizon  Behavioral synthesis  a critical technology for enabling the move to higher level of abstraction  Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of a solid RTL foundation Lack of consideration of physical reality Lack of consideration of physical reality

4 4 Motivation (2) u Behavioral Synthesis provides combined advantages  Better complexity management Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04] Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04]  Shorter verification/simulation cycle Simulation speed 100X faster than RTL-based method Simulation speed 100X faster than RTL-based method  Rapid system exploration Quick evaluation of different hardware/software boundaries Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Fast exploration of multiple micro-architecture alternatives  Higher quality of results Full consideration of physical reality Full consideration of physical reality

5 5 xPilot: Platform-Based Behavioral to RTL Synthesis Flow Behavioral spec. in C/SystemC RTL SSDM u  Arch-generation & RTL/constraints generation  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … u Presynthesis optimizations  Loop unrolling/shifting  Strength reduction / Tree height reduction  Bitwidth analysis  Memory analysis … FPGAs/ASICs Frontend compiler Platform description u Core synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding

6 6 System-level Synthesis Data Model u SSDM (System-level Synthesis Data Model)  Hierarchical netlist of concurrent processes and communication channels  Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics Port / IO interfaces, bit-vector manipulations, cycle-level notations Port / IO interfaces, bit-vector manipulations, cycle-level notations

7 7 Platform Modeling & Characterization u Target platform specification  High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Memories: registers, synchronous memories, etc.  Chip layout description On-chip resource distributions On-chip resource distributions On-chip interconnect delay/power estimation On-chip interconnect delay/power estimation

8 8 Scheduling  Goals u A highly versatile scheduling engine  Applicable to a wide range of application domains Computation-intensive, data/memory-intensive, control-intensive, etc. Computation-intensive, data/memory-intensive, control-intensive, etc. Mixed behavioral & RTL Mixed behavioral & RTL  Amenable to a rich set of scheduling constraints Data dependency constraints Data dependency constraints Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc.  Retargetable to a variety of design objectives High performance, small area, low power, etc. High performance, small area, low power, etc.

9 9 Scheduling  Optimization Capabilities u Offers a variety of optimization techniques in a unified framework  Combinational/Sequential non-pipelined/pipelined multi-cycle operation  Unconditional/Conditional operation chaining  Relative scheduling  Considerations of branching probabilities and repetitions  Multi-cycle communication (under development)  Code motion & speculation (under development)  Functional / loop pipelining (under development)  Physical layout integration (to be supported)

10 10 Scheduling  Current Status u Design objective  Focus on high-performance designs u Overall approach  Use a system of pairwise difference constraints to express all kinds of scheduling constraints  Represent the design objective in a linear function  The system is immediately solvable via any linear programming solver with integral solutions

11 11 Scheduling  Design Framework xPilot scheduler STG (State Transition Graph) System of pairwise difference constraints Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Constraint equations generation Objective function generation CDFG Linear programming solver LP solution interpretation User- specified design constraints& assignments Target platform modeling (resource library & chip layout)

12 12 Example : Greatest Common Divisor u GCD C description x = inport1; y = inport2; while (x != y) { if ( x > y ) x = x – y; else y = y – x; } *outport = x; x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5

13 13 Constraints Generation u Data dependency constraint  Operation v is data dependent on operation u, i.e., (u, v)  E s(v) – s(u)  0 where schedule variable s(v) represents the relative schedule of node v  Other constraints can be represented in a similar way … u The constraint equations form a system of pairwise difference constraints  Matrix A is totally unimodular  Feasibility check can be formulated as a single-source shortest path problem  Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem u: x_1 =  (x_0, x_1, x_2); v: cond2 = (x_1 > y_1);

14 14 Solution by LP Solver x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5 0 1 u Scheduling are performed across the basic block boundaries

15 15 Schedule Interpretation x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; if (cond1) { x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 =  (x_0, x_1, x_2); *outport = x_3; } x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0);

16 16 Deriving State Transition Graph u Final STG for GCD x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); if (cond1) { x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 =  (x_0, x_1, x_2); *outport = x_3; } cond3 || cond4

17 17 Unified Resource Binding u Provides an unified resource sharing framework to optimize for various design objectives  Simultaneous functional unit binding, register binding and port binding  Equipped with advanced techniques to optimized the interconnect and steering logic networks  Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc.  Extendable to exploit physical layout information

18 18 An FU/Register binding Example u Observations:  Binding has large impact to the resulting performance and cost  Functional unit and register binding are highly correlated Note : Assume all operations and variables are compatible for sharing

19 19 Drawbacks of Previous Work u Many existing algorithms focus on functional-unit- or register- “number” minimization  Technology advances – interconnect effect increasing 51% of the total dynamic power of a microprocessor in 0.13um tech. 51% of the total dynamic power of a microprocessor in 0.13um tech. Up to 80% of the dynamic power in future technologies Up to 80% of the dynamic power in future technologies  May generate larger amount of multiplexers and interconnects  Unfavorable performance and cost results u Optimization for unrealistic goals  Minimize “number” of FUs, registers, or multiplexors Should have detailed datapath models to guide the optimization Should have detailed datapath models to guide the optimization  No technology specific consideration Should have platform-specific characterizations Should have platform-specific characterizations

20 20 xPilot architecture exploration Iteration Resource Binding in xPilot No Yes Register Allocation/Binding FU Allocation/Binding Baseline Register Binding Improved?? STG (State Transition Graph) User- specified design constraints Target platform (resource library & chip layout) Datapath model for performance-cost estimation STG + Best Datapath Models

21 21 Design Space Exploration MUL Datapath for solution (1, 2, 4) (3) power delay pruned A State Transition Graph (STG) u Exploration phases:  Exploring Node 2: (1) (2) two mul (1) (2) two mul (1, 2) one mul (1, 2) one mul  Exploring  Exploring Node 3: (1) (2) (3) three mul (1, 2) (3) two mul (1, 3) (2) two mul  Exploring  Exploring Node 4: (1) (2) (3) (4) (1, 2, 4) (3) (1, 2) (3, 4) (1, 2) (3) (4) (1, 3, 4) (2) (1, 3) (2, 4) (1, 3) (2) (4)   …. C1 ’ C1 C2 C2’ > 1* 2*, 3* 4* 5* 6+ < 1* 2* 5* 3* 4* 6+ > < Compatible Graphs Datapath Model Curve for Design Space Pruning

22 22 Experimental Results  Benchmark Suite u Benchmark suite  PR, MCM: DSP kernels: pure additions/subtractions and multiplications DSP kernels: pure additions/subtractions and multiplications  CACHE Cache controller: control-intensive designs with cycle-accurate I/O operations Cache controller: control-intensive designs with cycle-accurate I/O operations  MOTION: Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations  IDCT: JPEG inverse discrete cosine transform: computation intensive JPEG inverse discrete cosine transform: computation intensive  DWT: JPEG2000 discrete wavelet transform: computation intensive with modest control flow JPEG2000 discrete wavelet transform: computation intensive with modest control flow  EDGELOOP: Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses

23 23 Experimental Results  Code Size Reduction

24 24 Experimental Results  Comparison with SPARK On Scheduling DesignsTool/Flow Synthesis Report Altera Quartus II report state#reg# fmax (MHz) LEregistermemdsp MOTION spark1318170.866636704 xpilot2411161.288826604 PR spark1336130.6508491032 xpilot1340178.71,34978300 IDCT spark176~40072.0110,8474,5470138 xpilot141413105.5311,4815,627064 xpilot-mem334451162.99,3516,0981,02464 CACHE spark Memory unsupported xpilot-mem4716161.637126530720 u SPARK [UCI/UCSD, 2004], a state of the art academic high- level synthesis tool

25 25 u On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark Designs SPARKxPilot F max Ratio xPilot/ SPARK Resource UsageF max Resource UsageF max LECOMB Lonely- Reg Comb- Reg DSP (MHz)LECOMB Lonely- Reg Comb- Reg DSP(MHz) PR110881502930123.531349713845520178.71.45 WANG121794202750118.891105527625168166.111.40 LEE1367105203150119.3215856912076874166.611.40 MCM280822480560074.8724029817313480152.562.04 DIR242520340391669.383489175211016274146.82.12 FEIG161701313603034637.1710539229524080044173.494.67 Total25095202270486812543.162046969597761273420984.271.81 Ave Ratio 1111111.170.65n/a*2.96n/a*2.48 Experimental Results  Comparison with SPARK On Binding

26 26 Synthesis Results for DWT (JPEG2000) Target cycle time State#fmax(MHz)Cycle# Latency (ns) LE#DSP# 9ns34123.56483039.11777128 7ns36147.28521135.41862128 5.5ns51183.62692637.81926128 u Settings  Target platform: Altera Stratix  RTL synthesis & place-and-route: Altera QuartusII v5.0  Simulation: Mentor ModelSim SE6.0 u Design alternatives

27 27 Experimental Results: ASIC Flow u u Magma RTL to GDSII flow u u Technology library: Cadence Generic Standard Cell Library 0.18um u u Tradeoff study:   1 st column: delay constraint enforced in xPilot   2 nd column: control step count of xPilot generated RTL   3 rd -5 th column: data reported after mapping by Magma tool DIR State # Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns51755512565842111473.7110555 10ns32307713322032139467.516417 15ns22838114864872181458.514362 20ns22718913944512514397.775028 30ns12779714016422725366.972725

28 28 Experimental Results: ASIC Flow (cont.) LEEState# Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns882425098072066484.0316528 10ns4159897088702254443.669016 15ns2166987033813423292.146846 20ns2152566561474226236.638452 30ns1160856973635070197.245070 MotionState# Area(u2)Delay(ps)Fmax(MHz)Latency(ps)10ns35164749097212107474.6173745 15ns30156958472622358424.0970740 20ns28164638678982498400.3269944 30ns28158078525732563390.1771764


Download ppt "XPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005."

Similar presentations


Ads by Google