Platform-Based Behavior-Level and System-Level Synthesis

Platform-Based Behavior-Level and System-Level Synthesis
Prof. Jason Cong UCLA Computer Science Department

Outline Motivation xPilot system framework
Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

ASICs SOC Example: Philips Nexperia
General-purpose scalable RISC processor 50 to 300+ MHz 32-bit or 64-bit Library of device IP blocks Image coprocessors DSPs UART 1394 USB … TM-xxxx D$ I$ TriMedia CPU DEVICE IP BLOCK . . . DVP SYSTEM SILICON PRxxxx MIPS CPU PI BUS SDRAM MMI DVP MEMORY BUS TriMedia™ MIPS™ Scalable VLIW media processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia™system buses bit Point out the general processor core, light-weight micro-engines, acceleration logic ACCESS CTL. MIPS MPEG VLIW VIDEO MSP Philips Nexperia SoC platform for high-end digital video Courtesy Philips

Field-Programmable SOC Example: Xilinx Virtex-4 FPGA
MicroBlaze 180MHz < ~1300 LUTs 166 DMIPS H.264/AVC hardware blocks Soft core Proc IBM CoreConnect™ Bus Micro-Blaze IP IP PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture) Courtesy Xilinx

Behavior-level Description Generic Logic Description
IC Design Steps Behavior-level Description RT-Level Description System-Level Specification Synthesis Physical Design Technology Mapping Placed & Routed Design Generic Logic Description Gate/Circuit Design Again, these steps are not etched in stone: there are lots of varieties Logic description is usually generated by Computer Aided Design (CAD) tools Gate-level design (also known as “netlist”) describes the design in the atomic entities of the technology. For a CMOS design, transistors are used. In an FPGA design, look-up tables (LUTs) are used. The process of converting a logic description to a gate-level design is called technology mapping. There are a *lot* of optimizations involved after technology mapping We might go back and forth between these steps (e.g., after gate-level desc., we might simulate and find bugs => go back to RTL or high-level description and fix the bug) I haven’t shown testing/verification This course helps you understand the methods and algorithms used for automatic high-level synthesis and and physical design You will develop small CAD tools that do these steps automatically. Fabri- cation X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Packaging [©Sherwani]

xPilot: Platform-Based Synthesis System
SystemC/C Platform Description & Constraints xPilot xPilot Front End Profiling SSDM (System-Level Synthesis Data Model) Analysis Mapping Processor & Architecture Synthesis Interface Synthesis Behavioral Synthesis Custom Logic Processor Cores + Executables Drivers + Glue Logic Embedded SoC Uniqueness of xPilot Platform-based synthesis and optimization Communication-centric synthesis with interconnect optimization

xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec. in C/SystemC Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … Platform description Frontend compiler Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … RTL + constraints FPGAs/ASICs

Advantages of Behavioral Synthesis
Shorter verification/simulation cycle 100X speed up with behavior-level simulation Better complexity management, faster time to market 10M gate design may require 700K lines of RTL code Rapid system exploration Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Higher quality of results Platform-based synthesis & optimization Full consideration of physical reality

Behavior Synthesis Has Been Tried and Failed – Why?
Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of consideration of physical reality Lack of widely accepted behavior models

xPilot Advantages Advanced algorithms for platform-based, communication-centric optimization Platform-based behavior and system synthesis Communication/interconnect-centric approach Complete validation through final P&R on FPGAs

Platform Modeling & Characterization
Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Chip layout description On-chip resource distributions On-chip interconnect delay/power estimation ALU MUX ALU Two binding solutions for same behavior: Which one is better? Answer is platform-dependent: How large/fast are the MUX and ALU? 0.58 1.8 2.8 2.0 2.9 3.7 3.8 4.7 3X3 Delay Matrix for Stratix-EP1S40

Advanced Behavior System Algorithms: Example: Versatile Scheduling Algorithm Based on SDC
Scheduling problem in behavioral synthesis is NP-Complete under general design constraints ILP-based solutions are versatile but very inefficient Exponential time complexity CS0 * +3 *1 *5 +2 +4 CS1 +4 +2 *5 *1 +3

Existing Scheduling Techniques for Behavioral Synthesis
Heuristic approach: Fast, but ad hoc (limited efficiency to specific applications) Data-flow-based scheduling (Targets data-flow-intensive designs, e.g., DSP applications, image processing applications, etc.) Control-flow-based scheduling (Targets control-flow-intensive designs e.g., controllers, network protocol processors, etc.) Exact approach: Versatile, but inefficient (poor scalability) ILP-based scheduling, e.g., [Huang et al., TCAD’91], etc. BDD-based symbolic scheduling, e.g., [Radivojevic and Brewer, TCAD’96] …

Scheduling  Our Approach
Overall approach Current objective: high-performance Use a system of integer difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function Dependency constraint v1  v3 : x3 – x1  0 v2  v3 : x3 – x2  0 v3  v5 : x4 – x3  0 v4  v5 : x5 – x4  0 Frequency constraint <v2 , v5> : x5 – x2  1 Resource constraint <v2 , v3>: x3 – x2  1 + * + v1 v2 v4 * v3  v5 -1 X1 X2 X3 X4 X5 Platform characterization: adder (+/–) 2ns multipiler (*): 5ns Target cycle time: 10ns Resource constraint: Only ONE multiplier is available  A x b Totally unimodular matrix: guarantees integral solutions

UPS Scheduling  Overall Framework
CDFG xPilot scheduler Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Constraint equations generation Target platform modeling (resource library & chip layout) User- specified design constraints& assignments Objective function generation System of pairwise difference constraints Linear programming solver LP solution interpretation STG (State Transition Graph)

UPS vs. SPARK: Results on SPARK’s Benchmarks
Mult (*): 2 cycles; Div (*) : 5 cycles; Rest: one cycle Target frequency: 7.5ns Benchmark SPARK UPS UPS / SPARK State# W. Cycle# MPEG2-dpframe 32 424 35 352 0.83 GIMP-tiler 27 2234 1877 0.84 ADPCM-decoder 15 327 13 278 0.85 ADPCM-encoder 16 133 112 Average Ratio UPS achieves 16% cycle count reduction over SPARK

Platform-Based Interface Synthesis
Focus on sequential communication media (SCM) FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.) Order may have dramatic impact on performance Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission Interface synthesis for SCM Consider both behavior and communication to determine the optimal transmission order for (int i=0; i <8; i++) { S1: data[i] = …; } C int s07 = data[0] + data[7]; Int s16 = data[1] + data[6]; ….. data[8] P1 P2 FIFO Custom Logic 1 Custom logic 2 PE2 PE1 DCT example

SCM Co-Optimization  Problem Formulation
Given: A set of processes P connected by a set of channels in C A set of data D = {d1, d2, …, dm} to be transmitted on each channel cj, Goal: Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications In the meantime, generate the drivers and glue logics for each process automatically

SystemC/C-to-RTL Design Flow
SystemC/C specification Front-end compiler xPilot behavioral synthesis SSDM (System-Level Synthesis Data Model) Platform description & constraints SSDM/CDFG Behavioral synthesis SSDM/FSMD RTL generation FSM with Datapath in VHDL Floorplan and/or multi- cycle path constraints RTL synthesis ASICs/FPGAs platform

Preliminary Results of xPilot  Better Complexity Management
Significant code size reduction RTL design  Behavioral design: 10x code size reduction VHDL code generated by UCLA xPilot targeting Altera Stratix platform

Design Exploration for Heterogeneous MPSoC Platforms
Heterogeneous MPSoCs exploration Processors Heterogeneous vs. homogeneous General-purpose vs. application-specific On-chip communication architecture (OCA) Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Memory hierarchy μP μP IP μP μP μP FPGA μP μP μP tasks DSP μP tasks tasks OS Driver OS Driver OS Driver Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Communication Network

Configurable SoC Platforms
General purpose processor cores + programmable fabric Tight integration using extended instructions (ASIPs) Example: Altera Nios / Nios II Loose integration using FIFOs/busses for communications Example: Xilinx MicroBlaze, etc. Custom instruction logic for Nios II [source: Xilinx MicroBlaze [source:

ASIP Compilation: Problem Statement
Given: CDFG G(V, E) The basic instruction set I Pattern constraints: Number of inputs |PI(pi)|  Nin; Number of outputs |PO(pi)| = 1; Total area Objective: Generate a pattern library P Map G to the extended instruction set IP, so that the total execution time is minimized t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; t4 = ext-inst1(a, b, c); t5 = ext-inst2(b, c, d, e); t6 = t4 + t5; Performance speedup = 9 / 5 = 1.8X c d e a b * * * + + ext-inst1 (MAC1: 2 cycles) ext-inst2 (MAC2: 2 cycles) t4 t5 + t6 * 2 clock cycles + 1 clock cycle

Target Core Processor Model
Classic single-issue pipelined RISC core (fetch / decode / execute / mem / write-back) The number of input and output operands of an instruction is pre-determined An instruction reads the core register file during the execute stage, and commits the result during the write-back stage IF / ID ID / EX EX / MEM MEM / WB RS1 Reg File Adder OP1 ALU 4 RS2 Memory PC Inst Cache OP2 MUX Result Core Processor Custom Logic

Front-end compilation
ASIP Compilation Flow C code Pattern Generation Satisfying input/output constraints Arch constraint Front-end compilation 1. Pattern generation CDFG 2. Pattern selection Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Pattern library 3. Application mapping & Graph covering Application Mapping Graph covering to minimize the total execution time Optimized CDFG Backend compilation Optimized assembly

Experimental Results on Altera Nios
Altera Nios is used for ASIP implementation 5 extended instruction formats up to 2048 instructions for each format Small DSP applications are taken as benchmark - 1.77% 2.54% 2.75 3.08 Average 56 0.00% 2.76% 186 3.22 4.75 4 mcm 16 0.80% 54 3.02 3.28 2 dir 14 1.05% 71 1.75 1.57 pr 8 0.15% 1,024 0.76% 51 2.14 2.40 fir 40 0.71% 4,736 3.79% 255 3.73 3.18 7 iir 9.79% 65,536 6.06% 408 2.65 9 fft_br DSP Block Memory LE Nios Estimation Resource Overhead Speedup Extended Instruction#

Architecture Extension for ASIPs
Data bandwidth problem Limited register file bandwidth (two read ports, one write port) ~40% of the ideal performance speedup will be lost Shadow-register-based architectural extension Core registers are augmented by an extra set of shadow registers Conditionally written during write-back stage Low power/area overhead Novel shadow-register binding algorithms are developed Inst Cache Reg File Memory MUX 4 Adder Result PC RS1 RS2 Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU Hashing Unit OP1 OP2 Custom Logic SR1 SRK … k = hash(j)

Ongoing Work -- Mapping for Heterogeneous Integration with Multiple Processing Cores
Given: A library of processing cores P and communication library C Task graph G(V, E) For each v in V, execution time t(v, pi) on pi For each (u, v) in E, communication data size s(u,v) Throughput constraint Problem: Select and instantiate the processing elements and communication channels from P and C respectively Map the tasks onto the processing elements and communications to the channels so that The optimal latency is achieved subject to the throughput constraint The implementation cost is minimized

Preliminary Results on Motion-JPEG Example
Preprocess DCT Quant Huffman Model #1 : 5 Microblazes FSL-based communication Table Modification OR Preprocess HW-DCT Quant Huffman Encoded JPEG Images Model #2 : 4 Microblazes + DCT on FPGA fabrics Table Modification RAW Images System Cycle# Fmax (MHZ) Exe Time (ms) Area (Slice#) Model #1 23812 126 0.189 4306 Model #2 (-38%) 0.117 6345 Xilinx XUP Board

Conclusions xPilot has fairly mature and advanced behavior synthesis capability from C or SystemC to RTL code with necessary design constraints xPilot advantages include Platform-based behavior and system synthesis Communication/interconnect-centric approach Advanced algorithms for platform-based, communication-centric optimization Promising results demonstrated on available FPGAs xPilot system synthesis capabilities Performance simulation of multi-processor systems Exploration the efficient use of (multiple) on-chip processors Compilation and optimization for reconfigurable processors

Acknowledgements We would like to thank the supports from
Gigascale Systems Research Center (GSRC) National Science Foundation (NSF) Semiconductor Research Corporation (SRC) Industrial sponsors under the California MICRO programs (Altera, Xilinx) Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang

EDA  Electronic Design Automation
Idea/Concept (high-level specification) Compilation& Synthesis High-end Automotive A/V application 10M Gates 70 clocks (320 MHz) Technology: 0.13u, 6LM Courtesy of Magma Design Automation

EDA Is A Key Enabling Technology for Semiconductor Industry
Computer-aided design (CAD) of very large-scale integrated (VLSI) circuits

Field-Programmable SOC Example: Altera Stratix II FPGA
Software defined radio (SDR) baseband data path reconfiguration Nios II /f 185MHz < 900ALMs (<1800LEs) 218 Max DMIPS Soft core Proc 90nm Stratix II 2S60 Nios II Avalon™ Bus IP IP Nios II Courtesy Altera

Electronic System-Level (ESL) Design Automation
Modeling SystemC -- OpenSource SystemVerilog Simulation and Verification Behavior-level simulation & verification System-level simulation & verification SystemC provides behavior-level and system-level synthesis capabilities for free -- rapidly gaining popularity Synthesis Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or Matlab) to RTL or netlists System-level synthesis: from system specification to system implementation

ESL Tools – A Lot of Interests …

Communication- and Interconnect-Centric Synthesis: Example: Use of Distributed Register-File Architectures Island A Data-Routing Logic Local Register File FUP MUX Functional Unit Pool MUL ALU ALU’ Island C Island B Input Buffers 3 2 4 1 1 2 4 3 Binding using discrete registers Distributed register-file micro-architecture: Efficiently use on-chip embedded memories Fully explore operation and data-transfer parallelism A scheduled DFG with register binding indicated on each variable (assume one-functional unit constraint) Binding using a register file: more efficient design!

Distributed Register-File Microarchitecture
Island A Data-Routing Logic Local Register File FUP MUX Functional Unit Pool MUL ALU ALU’ Island C Island B Input Buffers On-chip memory blocks FP-SoC Island A Island C Island B Xilinx XC-2V 2000 3000 4000 6000 8000 #18Kb BRAM 56 96 120 144 168 Dist. RAM(Kb) 336 448 720 1,056 1,456 On-chip RAM resource on Virtex II

Resource Binding for DRF-Microarchitecture
Intra-island transfers Facts under simplified assumptions Operations bound onto an island form a chain in the given scheduled DFG Inter-chain data transfers may share a physical inter-island connection The number of inter-island connections (IIC) is crucial to the QoR of a DRFM instance Inter-island transfers 1 v1 v6 2 v2 v7 3 v3 v9 4 v4 v5 v8 v10 Island (Chain) A B C D Inter-island connections = 5 (A,B)=(A,D)=1 (A,C)=1, two data transfers share one connection (C,D)=2

DRFM Binding Solution v3 v9 A B C D Overview:
v3 A 1 1 1 v1 v6 1 B 2 2 v2 v7 C 2 v9 3 v3 v9 D 2 4 v4 v5 v8 v10 C-step 1, 2 handled. For c-step 3: Construct weighted bipartite graph: Edge weight = # new introduced inter-island connections (IIC) Min-weight matching  optimal binding in this step Solution of this step: Matching: V3 Island A; V9  Island C New introduced IIC # = 0 Island (Chain) A B C D Overview: In step-by-step fashion Use weighted bipartite-matching to solve each step optimally Final Inter-Island Connections = 4

DRF Experimental Results: Three Experimental Flows for Comparison
xPilot Frontend xPilot behavioral synthesis system SSDM/CDFG Scheduling algorithms Scheduled CDFG (STG) 1) Binding on Discrete-Register Microarchitecture 2) Baseline (Random) DRF Binding 3) DRF Binding for Minimizing Inter-Island Connections RTL generation Xilinx Virtex II

DRF Experimental Results
Xilinx ISE 7.1; Virtex II; Target clock period: 8ns The baseline DRF binding results achieve 46.70% slice reduction over the discrete-register approach Optimized DRF binding reduces 12.21% further Overall, more than 2X logic slice reduction with better clock period (7.8%). Area (Slices, DRF solutions use on-chip RAM blocks) Clock period (ns)

Preliminary Result of xPilot  Better QoR (Comparison with UCI/UCSD SPARK)
Designs SPARK xPilot Delay Ratio Resource Usage Fmax xPilot /SPARK Slice DSP (MHz) (LUT) (FF) PR 588 981 247 92.85 331 416 564 16 146.84 1.58 WANG 660 1157 265 109.29 357 464 15 133.51 1.22 LEE 574 996 220 109.17 356 484 659 19 131.93 1.21 MCM 1062 1857 479 99.40 887 1207 1282 30 110.38 1.11 DIR 1323 2256 494 3 79.30 979 1002 1732 56 98.81 1.25 Ave Ratio 1 1.00 0.66 0.48 2.74 n/a 1.27 Device setting: Xilinx Virtex-II pro (xc2v ) Target frequency: 200 MHz

Proposed SCM Co-Optimization Design Flow
Platform Description & Constraints Process Network Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior

Communication Order Detection
Step 1. Construct a global CDFG by merging the individual CDFGs of each process Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG Process 1 Process 2 + T1 T2 T3  * Latency = 5 cycles Latency = 7 cycles Ti : FIFO

Loop Indices Compression
Given the optimal order, we try to generate restructured loops for code compression i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system Solution: i’=j, j’ = i;

Initial Results of Interface Synthesis
Target for sequential communication channels In particular, FSL in VirtexII Consider two communicating processes Total latency (Cycle#) RAs Compress Designs Trad. SCOOP Reduction Before After DCT1 325 290 10.77% Haar 142 134 5.63% DWT 689 617 10.45% Mat_mul 408 339 16.91% 96 20 DCT2 483 419 13.25% 80 64 Masking 620 420 32.26% 192 Dot 1903 1084 43.04% 300 An average of 26% improvement in total latency can be achieved.

MPEG-4 Simple Profile Decoder: Architecture Profiling
C specification overview Module Name Orig. C Source File Orig. C line # Copy Controller copyControl.c 287 Display Controller displayControl.c 358 Motion Comp. Motion-Compensation.c 312 Parser /VLD parser.c 1092 texture_vld.c 508 Texture /IDCT texture_idct.c 1901 Texture Update textureUpdate.c 220 Runtime Profiling (PowerPC/XUP board) Parser/VLD 59.0% Texture/IDCT 18.1% Motion Comp. 15.7% Copy Controller 3.6%

MPEG-4 Simple Profile Decoder: Hyprid HW/SW Impmentation
HW block Integrated with PowerPC single process design: 15% speed improvement Software blocks running on PowerPC

MPEG-4 Simple Profile Decoder: Alternate Implementations
Single uBlaze 7-uBlaze Single PowerPC Single PowerPC w/ HW Motion Comp. Throughput (Frame per Second) 0.59 1.18 3.06 3.53 Improvement - + 209% + 68.4% + 15.3% xPilot Synthesis Report of HW blocks Line counts Slices ( FFs, LUTs) MUL Clock period (ns) Latency (Cycles) C RTL SystemC RTL VHDL Motion Comp. 210 9903 5655 986 (1111, 1017) 2 7.97 505 Block IDCT 200 9534 2731 1877 (2376, 2438) 26 7.963 280 Texture Update 160 8227 4475 1551 (1696, 1931) 4 7.913 335

Advantages of Our Scheduling Algorithm
A highly versatile scheduling engine (UPS) Supports a wide spectrum of applications with high complexity Data-intensive, control-intensive, memory-intensive, mixed, etc. Honors a rich set of design constraints Resource constraints, relative timing constraints, frequency constraints, latency constraints, etc. Offers a variety of optimization techniques Operation chaining, pipelined multi-cycle operation, awareness of repetitions, behavioral templates, speculation, functional/loop pipelining, multi-cycle communication Accounts for physical reality Optimizes communications simultaneously with computations

Preliminary Results of xPilot  Rapid System Exploration
Quick evaluation of various amounts of process level concurrency and different hardware/software boundaries Example: Motion-JPEG implementation All HW implementation All SW implementation (using embedded processors) SW/HW co-design: optimal partitioning? Repeated manual RTL coding is not solution!

Preliminary Results of xPilot Shorter Simulation/Verification Cycle
From other projects: Simulation speed on behavior model 100X faster than RTL-based method [NEC, ASPDAC04] Our experience: Motion-compensation module in a Mpeg4-decoder Behavior level (in C language) simulation Less than 1 second per frame RTL SystemC simulation About 310 second per frame

Ongoing Work: Design Exploration for MPSoCs
A scalable architecture simulation infrastructure for architecture evaluation & performance/power estimation Need for structural abstraction of processors and interconnects Recent work such as Liberty is an effort along this direction Complete structural abstraction makes the simulation very slow Liberty is about 10X slower than SimpleScalar on Itanium model Hybrid approach Tradeoff between accuracy and simulation time Model interconnection accurately using SystemC (for accuracy) Cores modeled using Simplescalar (for simulation speed) Communication network synthesis Automatic interface synthesis is required Physical planning is needed for interconnect latency/power estimation

Platform-Based Behavior-Level and System-Level Synthesis

Similar presentations

Presentation on theme: "Platform-Based Behavior-Level and System-Level Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform-Based Behavior-Level and System-Level Synthesis

Similar presentations

Presentation on theme: "Platform-Based Behavior-Level and System-Level Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback