XPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005.

Slides:

Advertisements

Similar presentations

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

Advertisements

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.

The Design Process Outline Goal Reading Design Domain Design Flow

Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

George Mason University ECE 448 – FPGA and ASIC Design with VHDL Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts,

DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.

Architecture-Level Synthesis for Automatic Interconnect Pipelining

1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.

TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.

Shashi Kumar 1 Logic Synthesis: Course Introduction Shashi Kumar Embedded System Group Department of Electronics and Computer Engineering Jönköping Univ.

Titan: Large and Complex Benchmarks in Academic CAD

CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.

Automated Design of Custom Architecture Tulika Mitra

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

HDL-Based Layout Synthesis Methodologies Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications Amol Bakshi, Jingzhao Ou, and Viktor K. Prasanna.

LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Using Custom Accelerators in Wireless Systems Alex Papakonstantinou, Deming Chen Illinois Center for Wireless Systems Wireless SoC Design Trends and Challenges.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Architecture and Synthesis for Multi-Cycle Communication

Please do not distribute

Architectural-Level Synthesis

Architecture Synthesis

HIGH LEVEL SYNTHESIS.

Measuring the Gap between FPGAs and ASICs

Presentation transcript:

xPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx.

2 Outline u Motivation u xPilot system framework u Overview of the synthesis engine  Scheduling  Resource binding u Experimental results

3 Motivation (1) u Design Complexity is outgrowing the traditional RTL method  Feasible to build SoC device with 500M transistors; Billion- transistor chips are on the horizon  Behavioral synthesis  a critical technology for enabling the move to higher level of abstraction  Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of a solid RTL foundation Lack of consideration of physical reality Lack of consideration of physical reality

4 Motivation (2) u Behavioral Synthesis provides combined advantages  Better complexity management Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04] Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04]  Shorter verification/simulation cycle Simulation speed 100X faster than RTL-based method Simulation speed 100X faster than RTL-based method  Rapid system exploration Quick evaluation of different hardware/software boundaries Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Fast exploration of multiple micro-architecture alternatives  Higher quality of results Full consideration of physical reality Full consideration of physical reality

5 xPilot: Platform-Based Behavioral to RTL Synthesis Flow Behavioral spec. in C/SystemC RTL SSDM u  Arch-generation & RTL/constraints generation  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … u Presynthesis optimizations  Loop unrolling/shifting  Strength reduction / Tree height reduction  Bitwidth analysis  Memory analysis … FPGAs/ASICs Frontend compiler Platform description u Core synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding

6 System-level Synthesis Data Model u SSDM (System-level Synthesis Data Model)  Hierarchical netlist of concurrent processes and communication channels  Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics Port / IO interfaces, bit-vector manipulations, cycle-level notations Port / IO interfaces, bit-vector manipulations, cycle-level notations

7 Platform Modeling & Characterization u Target platform specification  High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Memories: registers, synchronous memories, etc.  Chip layout description On-chip resource distributions On-chip resource distributions On-chip interconnect delay/power estimation On-chip interconnect delay/power estimation

8 Scheduling  Goals u A highly versatile scheduling engine  Applicable to a wide range of application domains Computation-intensive, data/memory-intensive, control-intensive, etc. Computation-intensive, data/memory-intensive, control-intensive, etc. Mixed behavioral & RTL Mixed behavioral & RTL  Amenable to a rich set of scheduling constraints Data dependency constraints Data dependency constraints Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc.  Retargetable to a variety of design objectives High performance, small area, low power, etc. High performance, small area, low power, etc.

9 Scheduling  Optimization Capabilities u Offers a variety of optimization techniques in a unified framework  Combinational/Sequential non-pipelined/pipelined multi-cycle operation  Unconditional/Conditional operation chaining  Relative scheduling  Considerations of branching probabilities and repetitions  Multi-cycle communication (under development)  Code motion & speculation (under development)  Functional / loop pipelining (under development)  Physical layout integration (to be supported)

10 Scheduling  Current Status u Design objective  Focus on high-performance designs u Overall approach  Use a system of pairwise difference constraints to express all kinds of scheduling constraints  Represent the design objective in a linear function  The system is immediately solvable via any linear programming solver with integral solutions

11 Scheduling  Design Framework xPilot scheduler STG (State Transition Graph) System of pairwise difference constraints Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Constraint equations generation Objective function generation CDFG Linear programming solver LP solution interpretation User- specified design constraints& assignments Target platform modeling (resource library & chip layout)

12 Example : Greatest Common Divisor u GCD C description x = inport1; y = inport2; while (x != y) { if ( x > y ) x = x – y; else y = y – x; } *outport = x; x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5

13 Constraints Generation u Data dependency constraint  Operation v is data dependent on operation u, i.e., (u, v)  E s(v) – s(u)  0 where schedule variable s(v) represents the relative schedule of node v  Other constraints can be represented in a similar way … u The constraint equations form a system of pairwise difference constraints  Matrix A is totally unimodular  Feasibility check can be formulated as a single-source shortest path problem  Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem u: x_1 =  (x_0, x_1, x_2); v: cond2 = (x_1 > y_1);

14 Solution by LP Solver x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; T T T T BB1 BB2 BB3BB4 BB5 0 1 u Scheduling are performed across the basic block boundaries

15 Schedule Interpretation x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 =  (x_0, x_1, x_2); *outport = x_3; if (cond1) { x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 =  (x_0, x_1, x_2); *outport = x_3; } x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0);

16 Deriving State Transition Graph u Final STG for GCD x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); if (cond1) { x_1 =  (x_0, x_1, x_2); y_1 =  (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 =  (x_0, x_1, x_2); *outport = x_3; } cond3 || cond4

17 Unified Resource Binding u Provides an unified resource sharing framework to optimize for various design objectives  Simultaneous functional unit binding, register binding and port binding  Equipped with advanced techniques to optimized the interconnect and steering logic networks  Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc.  Extendable to exploit physical layout information

18 An FU/Register binding Example u Observations:  Binding has large impact to the resulting performance and cost  Functional unit and register binding are highly correlated Note : Assume all operations and variables are compatible for sharing

19 Drawbacks of Previous Work u Many existing algorithms focus on functional-unit- or register- “number” minimization  Technology advances – interconnect effect increasing 51% of the total dynamic power of a microprocessor in 0.13um tech. 51% of the total dynamic power of a microprocessor in 0.13um tech. Up to 80% of the dynamic power in future technologies Up to 80% of the dynamic power in future technologies  May generate larger amount of multiplexers and interconnects  Unfavorable performance and cost results u Optimization for unrealistic goals  Minimize “number” of FUs, registers, or multiplexors Should have detailed datapath models to guide the optimization Should have detailed datapath models to guide the optimization  No technology specific consideration Should have platform-specific characterizations Should have platform-specific characterizations

20 xPilot architecture exploration Iteration Resource Binding in xPilot No Yes Register Allocation/Binding FU Allocation/Binding Baseline Register Binding Improved?? STG (State Transition Graph) User- specified design constraints Target platform (resource library & chip layout) Datapath model for performance-cost estimation STG + Best Datapath Models

21 Design Space Exploration MUL Datapath for solution (1, 2, 4) (3) power delay pruned A State Transition Graph (STG) u Exploration phases:  Exploring Node 2: (1) (2) two mul (1) (2) two mul (1, 2) one mul (1, 2) one mul  Exploring  Exploring Node 3: (1) (2) (3) three mul (1, 2) (3) two mul (1, 3) (2) two mul  Exploring  Exploring Node 4: (1) (2) (3) (4) (1, 2, 4) (3) (1, 2) (3, 4) (1, 2) (3) (4) (1, 3, 4) (2) (1, 3) (2, 4) (1, 3) (2) (4)   …. C1 ’ C1 C2 C2’ > 1* 2*, 3* 4* 5* 6+ < 1* 2* 5* 3* 4* 6+ > < Compatible Graphs Datapath Model Curve for Design Space Pruning

22 Experimental Results  Benchmark Suite u Benchmark suite  PR, MCM: DSP kernels: pure additions/subtractions and multiplications DSP kernels: pure additions/subtractions and multiplications  CACHE Cache controller: control-intensive designs with cycle-accurate I/O operations Cache controller: control-intensive designs with cycle-accurate I/O operations  MOTION: Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations  IDCT: JPEG inverse discrete cosine transform: computation intensive JPEG inverse discrete cosine transform: computation intensive  DWT: JPEG2000 discrete wavelet transform: computation intensive with modest control flow JPEG2000 discrete wavelet transform: computation intensive with modest control flow  EDGELOOP: Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses

23 Experimental Results  Code Size Reduction

24 Experimental Results  Comparison with SPARK On Scheduling DesignsTool/Flow Synthesis Report Altera Quartus II report state#reg# fmax (MHz) LEregistermemdsp MOTION spark xpilot PR spark xpilot , IDCT spark176~ ,8474, xpilot ,4815, xpilot-mem ,3516,0981,02464 CACHE spark Memory unsupported xpilot-mem u SPARK [UCI/UCSD, 2004], a state of the art academic high- level synthesis tool

25 u On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark Designs SPARKxPilot F max Ratio xPilot/ SPARK Resource UsageF max Resource UsageF max LECOMB Lonely- Reg Comb- Reg DSP (MHz)LECOMB Lonely- Reg Comb- Reg DSP(MHz) PR WANG LEE MCM DIR FEIG Total Ave Ratio n/a*2.96n/a*2.48 Experimental Results  Comparison with SPARK On Binding

26 Synthesis Results for DWT (JPEG2000) Target cycle time State#fmax(MHz)Cycle# Latency (ns) LE#DSP# 9ns ns ns u Settings  Target platform: Altera Stratix  RTL synthesis & place-and-route: Altera QuartusII v5.0  Simulation: Mentor ModelSim SE6.0 u Design alternatives

27 Experimental Results: ASIC Flow u u Magma RTL to GDSII flow u u Technology library: Cadence Generic Standard Cell Library 0.18um u u Tradeoff study:   1 st column: delay constraint enforced in xPilot   2 nd column: control step count of xPilot generated RTL   3 rd -5 th column: data reported after mapping by Magma tool DIR State # Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns ns ns ns ns

28 Experimental Results: ASIC Flow (cont.) LEEState# Cell count Area(u2)Delay(ps)Fmax(MHz)Latency(ps) 5ns ns ns ns ns MotionState# Area(u2)Delay(ps)Fmax(MHz)Latency(ps)10ns ns ns ns