XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.

xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department

Motivation u Design complexity is outgrowing the traditional RTL method even in current CMOS technologies u Nanotechnology will enable 10-100x increase in device density and degree of integration u Need to enable higher level of design abstraction  Start from behavior descriptions (e.g. C or SystemC)  Use and/or re-use more complex functional unit (e.g. processor cores instead of standard cells)

ESL Tools – A Lot of Interests …

xPilot: Platform-Based Synthesis System xPilot Behavioral Synthesis Processor & Architecture Synthesis SSDM (System-Level Synthesis Data Model) FPSoC Interface Synthesis Analysis Mapping Profiling Processor Cores + Executables Drivers + Glue Logic Custom Logic xPilot Front End SystemC/C Platform Description & Constraints u Uniqueness of xPilot  Platform-based synthesis and optimization  Communication-centric synthesis with interconnect optimization

xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/SystemC RTL + constraints SSDM u  Arch-generation & RTL/constraints generation  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … u Presynthesis optimizations  Loop unrolling/shifting  Strength reduction / Tree height reduction  Bitwidth analysis  Memory analysis … FPGAs/ASICs Frontend compiler Platform description u Core synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding

System-Level Exploration Using xPilot for Heterogeneous MPSoC Platforms u Heterogeneous MPSoCs exploration  Processors Heterogeneous vs. homogeneous Heterogeneous vs. homogeneous General-purpose vs. application-specific General-purpose vs. application-specific  On-chip communication architecture (OCA) Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364)  Memory hierarchy μPμP Communication Network μPμP OS Driver tasks μPμP Network Interface Network Interface Network Interface Network Interface IP μPμP FPGA μPμP Network Interface Network Interface Network Interface Network Interface DSP μPμP μPμP OS Driver tasks Network Interface Network Interface μPμP μPμP OS Driver tasks Network Interface Network Interface

Outline u xPilot Overview  Behavior-level synthesis in xPilot  System-level synthesis in xPilot u Recent Progress in xPilot  Interface synthesis  Resource binding based on distributed register architecture u Conclusions

Advantage of Behavior Synthesis u Shorter verification/simulation cycle u Better complexity management, faster time to market u Rapid system exploration  Quick evaluation of different hardware/software boundaries  Fast exploration of multiple micro-architecture alternatives u Higher quality of results  Platform-based synthesis & optimization  Full consideration of physical reality

Example: Better Complexity Management u Shorter verification/simulation cycle  Simulation speed 100X faster than RTL-based method [NEC, ASPDAC04] u Significant code size reduction  RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04]  VHDL code generated by UCLA xPilot targeting Altera Stratix platform  Over 10x code size reduction can be achieved

Unique Features of xPilot (1): Platform-based Synthesis & Optimization u Platform-based synthesis & optimization  The quality of a RTL design is platform-dependent  Designers often lack the complete and detail knowledge of the target platform ResourceAreaDelay (ns) ADDSUB-24b 25 LUTs2.27 ADDSUB-32b 33 LUTs2.61 MUX8to1-24b 120 LUTs2.92 MUX16to1-24b 264 LUTs4.658 DSPMUL-18bx18b 2 DSP Blocks3.833 DSPMUL-24bx24b 8 DSP Blocks7.688  Platform: Altera Stratix  RTL synthesis & place-and-route: Altera QuartusII v5.0 0.581.82.82.02.93.7 2.83.84.7 3X3 Delay Matrix (0,0) (95,61)

Unique Features of xPilot (2): Communication-Centric Synthesis & Optimization  System performance & power is dominated by interconnect  It is difficult for designers to consider physical layout at the RT level Data transfer add 1 mul 1 add 2 mul 2 Layout-aware performance optimization Overlap computation with communication Layout-aware power optimization F C2’ > 2*, 3* 5* 4* < mul 1 (2,5,6) mul 2 (3,4) 6* mul 1 (2,4,5) mul 2 (3,6) Binding solution 2: mul 2 can be powered off when false branch is taken T Binding solution 1: Both multipliers keep active

Unique Features of xPilot (3): Highly Scalable and Optimized Synthesis Algorithms u Use of highly scalable and optimized synthesis algorithms for best quality of results  Interface synthesis: Simultaneous data and communication scheduling for latency minimization  Scheduling: A unified framework for multi-constraints and multi- objective scheduling based on the system of difference constraints (SDC)  Resource binding: Use of distributed register architectures for interconnect/communication optimization  Power optimization: Optimal functional module and voltage binding  …

Behavior and Communication Co-Optimization for Systems with SCM u SCM : Sequential Communication Media  FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.)  Data must be read and written in the same order  Order may have dramatic impact on performance Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission for (int i=0; i <8; i++) { S1: data[i] = …; } int s07 = data[0] + data[7]; Int s16 = data[1] + data[6]; ….. Custo m Logic 1 Custom logic 2 DCT example P1 P2 C PE1 PE2 FIFO data[8]

SCM Co-Optimization  Problem Formulation u Given:  A set of processes P connected by a set of channels in C  A set of data D = { d 1, d 2, …, d m } to be transmitted on each channel c j, u Goal:  Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications  In the meantime, generate the drivers and glue logics for each process automatically

Proposed SCM Co-Optimization Design Flow SCOOP (SCM CO-Optimization) System-Level Synthesis Data Model Code transformation and interface generation Drivers + Glue Logics Front End Process Network Platform Description & Constraints Communication order detection Indices compression for loop reordering Process Behavior

Communication Order Detection u Step 1. Construct a global CDFG by merging the individual CDFGs of each process u Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG Process 1 Process 2 + + T1 T2 T3  *   + T1 T2 T3   *   + T1 T2 T3   *  Latency = 5 cycles Latency = 7 cycles Ti : FIFO

Loop Indices Compression u Given the optimal order, we try to generate restructured loops for code compression  i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system Solution: i ’= j, j ’ = i ;

Preliminary Experimental Results Total latency (Cycle#)RAs Compress DesignsTrad.SCOOPReductionBeforeAfter DCT132529010.77%00 Haar1421345.63%00 DWT68961710.45%00 Mat_mul40833916.91%9620 DCT248341913.25%8064 Masking62042032.26%1920 Dot1903108443.04%3000 An average of 26% improvement in total latency can be achieved. u Experimental setting  Target communication model: two-process producer-consumer model  Behavioral synthesizer: UCLA xPilot  RTL simulator : Mentor ModelSim

Advantage of Register-File Microarchitectures u (a) A scheduled DFG with register binding indicated on each variable 1 2 4 3 1 2 3241 (a)(c)(b) u (b) Binding using discrete registers u (c) Binding using a register file

Distributed Register-File Microarchitecture Island A Data-Routing Logic Local Register File FUP MUX Functional Unit Pool MUL ALU ALU’ Island C Island B Input Buffers Xilinx XC-2V20003000400060008000 #18Kb BRAM5696120144168 Dist. RAM(Kb)3364487201,0561,456 Altera EP1S25S30S40S60S80 #M512(512b)224295384574767 #M4K(4Kb)138171183292364 #M-(512Kb)24469 FP-SoC Island A Island B Island C On-chip memory blocks On-chip RAM resource (Virtex II and Stratix)

Resource Binding for DRFM u Facts under simplified assumptions  Operations bound onto an island form a chain in the given scheduled DFG  Inter-chain data transfers may share a physical inter-island connection u The number of inter-island connections is crucial to the QoR of a DRFM instance v1v1 v2v2 v4v4 v3v3 v5v5 v8v8 v 10 A BCD 1 2 3 4 v7v7 v6v6 v9v9 u Inter-island connections  (A,B)=(A,D)=1  (A,C)=1, two data transfers share one connection  (C,D)=2

Resource Binding Problem for DRFM u General DRFM binding problem  Given scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the quality of B is optimized. Hard to characterize the quality of binding solution B Hard to characterize the quality of binding solution B The problem is too ad-hoc The problem is too ad-hoc u Relaxed problem – DRFM Binding for Minimizing Inter- Island Connections:  Given a scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the total number of inter-island connections of B is minimized.  Solution: control-step by step binding with min-cost bipartite matching

Three Experimental Flows for Comparison xPilot behavioral synthesis system SSDM/CDFG Scheduling algorithms RTL generation Scheduled CDFG (STG) 2) Baseline (Random) DRFM Binding 3) DRFM Binding for Minimizing Inter-Island Connections 1) Binding on Discrete-Register Microarchitecture Xilinx Virtex II xPilot Frontend

Experimental Results u Xilinx ISE 7.1; Virtex II; Target clock period: 8ns u The baseline DRFM binding results achieve 46.70% slice reduction over the discrete- register approach u Optimized DRFM binding reduces 12.21% further u Overall, more than 2X logic slice reduction with better clock period (7.8%). Area (Slices, DRF solutions use on-chip RAM blocks) Clock period (ns)

Conclusions u xPilot can automatically synthesize behavior level C or SystemC presentation to RTL code with necessary design constraints u Platform-based synthesis with physical planning provides  Shorter verification/simulation cycle  Better complexity management, faster time to market  Rapid system exploration  Higher quality of results u xPilot can help to explore the efficient use of (multiple) on-chip processors u xPilot can efficiently optimize the software for reconfigurable processors u We are interested to engage with selected industrial partners to further validate and enhance the technology

Acknowledgements u We would like to thank the supports from  National Science Foundation (NSF)  Gigascale Systems Research Center (GSRC)  Semiconductor Research Corporation (SRC)  Industrial sponsors under the California MICRO programs (Altera, Xilinx) u Team members: Yiping Fan Zhiru Zhang Wei Jiang Guoling Han

XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.

Similar presentations

Presentation on theme: "XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.

Similar presentations

Presentation on theme: "XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback