Platform-Based Behavior-Level and System-Level Synthesis

Slides:



Advertisements
Similar presentations
Graduate Computer Architecture I Lecture 16: FPGA Design.
Advertisements

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
XPilot  A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005.
Configurable System-on-Chip: Xilinx EDK
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
XPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong UCLA Computer Science Department.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
THE PHILIPS NEXPERIA DIGITAL VIDEO PLATFORM. The Digital Video Revolution  Transition from Analog to Digital Video  Navigate, store, retrieve and share.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Paper Review: XiSystem - A Reconfigurable Processor and System
Extreme Makeover for EDA Industry
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
System-on-Chip Design Hao Zheng Comp Sci & Eng U of South Florida 1.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
System-on-Chip Design
Architecture and Synthesis for Multi-Cycle Communication
Introduction to Programmable Logic
Application-Specific Customization of Soft Processor Microarchitecture
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab
Redundancy-Aware, Fault-Tolerant Clustering
HIGH LEVEL SYNTHESIS.
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

Platform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department

Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

ASICs SOC Example: Philips Nexperia General-purpose scalable RISC processor 50 to 300+ MHz 32-bit or 64-bit Library of device IP blocks Image coprocessors DSPs UART 1394 USB … TM-xxxx D$ I$ TriMedia CPU DEVICE IP BLOCK . . . DVP SYSTEM SILICON PRxxxx MIPS CPU PI BUS SDRAM MMI DVP MEMORY BUS TriMedia™ MIPS™ Scalable VLIW media processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia™system buses 32-128 bit Point out the general processor core, light-weight micro-engines, acceleration logic ACCESS CTL. MIPS MPEG VLIW VIDEO MSP Philips Nexperia SoC platform for high-end digital video Courtesy Philips

Field-Programmable SOC Example: Xilinx Virtex-4 FPGA MicroBlaze 180MHz < ~1300 LUTs 166 DMIPS H.264/AVC hardware blocks Soft core Proc IBM CoreConnect™ Bus Micro-Blaze IP IP PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture) Courtesy Xilinx

Behavior-level Description Generic Logic Description IC Design Steps Behavior-level Description RT-Level Description System-Level Specification Synthesis Physical Design Technology Mapping Placed & Routed Design Generic Logic Description Gate/Circuit Design Again, these steps are not etched in stone: there are lots of varieties Logic description is usually generated by Computer Aided Design (CAD) tools Gate-level design (also known as “netlist”) describes the design in the atomic entities of the technology. For a CMOS design, transistors are used. In an FPGA design, look-up tables (LUTs) are used. The process of converting a logic description to a gate-level design is called technology mapping. There are a *lot* of optimizations involved after technology mapping We might go back and forth between these steps (e.g., after gate-level desc., we might simulate and find bugs => go back to RTL or high-level description and fix the bug) I haven’t shown testing/verification This course helps you understand the methods and algorithms used for automatic high-level synthesis and and physical design You will develop small CAD tools that do these steps automatically. Fabri- cation X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Packaging [©Sherwani]

xPilot: Platform-Based Synthesis System SystemC/C Platform Description & Constraints xPilot xPilot Front End Profiling SSDM (System-Level Synthesis Data Model) Analysis Mapping Processor & Architecture Synthesis Interface Synthesis Behavioral Synthesis Custom Logic Processor Cores + Executables Drivers + Glue Logic Embedded SoC Uniqueness of xPilot Platform-based synthesis and optimization Communication-centric synthesis with interconnect optimization

Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/SystemC Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … Platform description Frontend compiler Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … RTL + constraints FPGAs/ASICs

Advantages of Behavioral Synthesis Shorter verification/simulation cycle 100X speed up with behavior-level simulation Better complexity management, faster time to market 10M gate design may require 700K lines of RTL code Rapid system exploration Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Higher quality of results Platform-based synthesis & optimization Full consideration of physical reality

Behavior Synthesis Has Been Tried and Failed – Why? Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of consideration of physical reality Lack of widely accepted behavior models

xPilot Advantages Advanced algorithms for platform-based, communication-centric optimization Platform-based behavior and system synthesis Communication/interconnect-centric approach Complete validation through final P&R on FPGAs

Platform Modeling & Characterization Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Chip layout description On-chip resource distributions On-chip interconnect delay/power estimation ALU MUX ALU Two binding solutions for same behavior: Which one is better? Answer is platform-dependent: How large/fast are the MUX and ALU? 0.58 1.8 2.8 2.0 2.9 3.7 3.8 4.7 3X3 Delay Matrix for Stratix-EP1S40

Advanced Behavior System Algorithms: Example: Versatile Scheduling Algorithm Based on SDC Scheduling problem in behavioral synthesis is NP-Complete under general design constraints ILP-based solutions are versatile but very inefficient Exponential time complexity CS0 * + +3 *1 *5 +2 +4 CS1 +4 +2 *5 *1 +3

Existing Scheduling Techniques for Behavioral Synthesis Heuristic approach: Fast, but ad hoc (limited efficiency to specific applications) Data-flow-based scheduling (Targets data-flow-intensive designs, e.g., DSP applications, image processing applications, etc.) Control-flow-based scheduling (Targets control-flow-intensive designs e.g., controllers, network protocol processors, etc.) Exact approach: Versatile, but inefficient (poor scalability) ILP-based scheduling, e.g., [Huang et al., TCAD’91], etc. BDD-based symbolic scheduling, e.g., [Radivojevic and Brewer, TCAD’96] …

Scheduling  Our Approach Overall approach Current objective: high-performance Use a system of integer difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function Dependency constraint v1  v3 : x3 – x1  0 v2  v3 : x3 – x2  0 v3  v5 : x4 – x3  0 v4  v5 : x5 – x4  0 Frequency constraint <v2 , v5> : x5 – x2  1 Resource constraint <v2 , v3>: x3 – x2  1 + * + v1 v2 v4 * v3  v5 -1 1 0 -1 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 1 0 0 -1 X1 X2 X3 X4 X5 Platform characterization: adder (+/–) 2ns multipiler (*): 5ns Target cycle time: 10ns Resource constraint: Only ONE multiplier is available  A x b Totally unimodular matrix: guarantees integral solutions

UPS Scheduling  Overall Framework CDFG xPilot scheduler Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Constraint equations generation Target platform modeling (resource library & chip layout) User- specified design constraints& assignments Objective function generation System of pairwise difference constraints Linear programming solver LP solution interpretation STG (State Transition Graph)

UPS vs. SPARK: Results on SPARK’s Benchmarks Mult (*): 2 cycles; Div (*) : 5 cycles; Rest: one cycle Target frequency: 7.5ns Benchmark SPARK UPS UPS / SPARK State# W. Cycle# MPEG2-dpframe 32 424 35 352 0.83 GIMP-tiler 27 2234 1877 0.84 ADPCM-decoder 15 327 13 278 0.85 ADPCM-encoder 16 133 112 Average Ratio UPS achieves 16% cycle count reduction over SPARK

Platform-Based Interface Synthesis Focus on sequential communication media (SCM) FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.) Order may have dramatic impact on performance Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission Interface synthesis for SCM Consider both behavior and communication to determine the optimal transmission order for (int i=0; i <8; i++) { S1: data[i] = …; } C int s07 = data[0] + data[7]; Int s16 = data[1] + data[6]; ….. data[8] P1 P2 FIFO Custom Logic 1 Custom logic 2 PE2 PE1 DCT example

SCM Co-Optimization  Problem Formulation Given: A set of processes P connected by a set of channels in C A set of data D = {d1, d2, …, dm} to be transmitted on each channel cj, Goal: Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications In the meantime, generate the drivers and glue logics for each process automatically

SystemC/C-to-RTL Design Flow SystemC/C specification Front-end compiler xPilot behavioral synthesis SSDM (System-Level Synthesis Data Model) Platform description & constraints SSDM/CDFG Behavioral synthesis SSDM/FSMD RTL generation FSM with Datapath in VHDL Floorplan and/or multi- cycle path constraints RTL synthesis ASICs/FPGAs platform

Preliminary Results of xPilot  Better Complexity Management Significant code size reduction RTL design  Behavioral design: 10x code size reduction VHDL code generated by UCLA xPilot targeting Altera Stratix platform

Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

Design Exploration for Heterogeneous MPSoC Platforms Heterogeneous MPSoCs exploration Processors Heterogeneous vs. homogeneous General-purpose vs. application-specific On-chip communication architecture (OCA) Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Memory hierarchy μP μP IP μP μP μP FPGA μP μP μP tasks DSP μP tasks tasks OS Driver OS Driver OS Driver Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Communication Network

Configurable SoC Platforms General purpose processor cores + programmable fabric Tight integration using extended instructions (ASIPs) Example: Altera Nios / Nios II Loose integration using FIFOs/busses for communications Example: Xilinx MicroBlaze, etc. Custom instruction logic for Nios II [source: www.altera.com] Xilinx MicroBlaze [source: www.xilinx.com]

ASIP Compilation: Problem Statement Given: CDFG G(V, E) The basic instruction set I Pattern constraints: Number of inputs |PI(pi)|  Nin; Number of outputs |PO(pi)| = 1; Total area Objective: Generate a pattern library P Map G to the extended instruction set IP, so that the total execution time is minimized t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; t4 = ext-inst1(a, b, c); t5 = ext-inst2(b, c, d, e); t6 = t4 + t5; Performance speedup = 9 / 5 = 1.8X c d e a b * * * + + ext-inst1 (MAC1: 2 cycles) ext-inst2 (MAC2: 2 cycles) t4 t5 + t6 * 2 clock cycles + 1 clock cycle

Target Core Processor Model Classic single-issue pipelined RISC core (fetch / decode / execute / mem / write-back) The number of input and output operands of an instruction is pre-determined An instruction reads the core register file during the execute stage, and commits the result during the write-back stage IF / ID ID / EX EX / MEM MEM / WB RS1 Reg File Adder OP1 ALU 4 RS2 Memory PC Inst Cache OP2 MUX Result Core Processor Custom Logic

Front-end compilation ASIP Compilation Flow C code Pattern Generation Satisfying input/output constraints Arch constraint Front-end compilation 1. Pattern generation CDFG 2. Pattern selection Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Pattern library 3. Application mapping & Graph covering Application Mapping Graph covering to minimize the total execution time Optimized CDFG Backend compilation Optimized assembly

Experimental Results on Altera Nios Altera Nios is used for ASIP implementation 5 extended instruction formats up to 2048 instructions for each format Small DSP applications are taken as benchmark - 1.77% 2.54% 2.75 3.08 Average 56 0.00% 2.76% 186 3.22 4.75 4 mcm 16 0.80% 54 3.02 3.28 2 dir 14 1.05% 71 1.75 1.57 pr 8 0.15% 1,024 0.76% 51 2.14 2.40 fir 40 0.71% 4,736 3.79% 255 3.73 3.18 7 iir 9.79% 65,536 6.06% 408 2.65 9 fft_br DSP Block Memory LE Nios Estimation Resource Overhead Speedup Extended Instruction#

Architecture Extension for ASIPs Data bandwidth problem Limited register file bandwidth (two read ports, one write port) ~40% of the ideal performance speedup will be lost Shadow-register-based architectural extension Core registers are augmented by an extra set of shadow registers Conditionally written during write-back stage Low power/area overhead Novel shadow-register binding algorithms are developed Inst Cache Reg File Memory MUX 4 Adder Result PC RS1 RS2 Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU Hashing Unit OP1 OP2 Custom Logic SR1 SRK … k = hash(j)

Ongoing Work -- Mapping for Heterogeneous Integration with Multiple Processing Cores Given: A library of processing cores P and communication library C Task graph G(V, E) For each v in V, execution time t(v, pi) on pi For each (u, v) in E, communication data size s(u,v) Throughput constraint Problem: Select and instantiate the processing elements and communication channels from P and C respectively Map the tasks onto the processing elements and communications to the channels so that The optimal latency is achieved subject to the throughput constraint The implementation cost is minimized

Preliminary Results on Motion-JPEG Example Preprocess DCT Quant Huffman Model #1 : 5 Microblazes FSL-based communication Table Modification OR Preprocess HW-DCT Quant Huffman Encoded JPEG Images Model #2 : 4 Microblazes + DCT on FPGA fabrics Table Modification RAW Images System Cycle# Fmax (MHZ) Exe Time (ms) Area (Slice#) Model #1 23812 126 0.189 4306 Model #2 14800 (-38%) 0.117 6345 Xilinx XUP Board

Conclusions xPilot has fairly mature and advanced behavior synthesis capability from C or SystemC to RTL code with necessary design constraints xPilot advantages include Platform-based behavior and system synthesis Communication/interconnect-centric approach Advanced algorithms for platform-based, communication-centric optimization Promising results demonstrated on available FPGAs xPilot system synthesis capabilities Performance simulation of multi-processor systems Exploration the efficient use of (multiple) on-chip processors Compilation and optimization for reconfigurable processors

Acknowledgements We would like to thank the supports from Gigascale Systems Research Center (GSRC) National Science Foundation (NSF) Semiconductor Research Corporation (SRC) Industrial sponsors under the California MICRO programs (Altera, Xilinx) Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang

EDA  Electronic Design Automation Idea/Concept (high-level specification) Compilation& Synthesis High-end Automotive A/V application 10M Gates 70 clocks (320 MHz) Technology: 0.13u, 6LM Courtesy of Magma Design Automation

EDA Is A Key Enabling Technology for Semiconductor Industry Computer-aided design (CAD) of very large-scale integrated (VLSI) circuits

Field-Programmable SOC Example: Altera Stratix II FPGA Software defined radio (SDR) baseband data path reconfiguration Nios II /f 185MHz < 900ALMs (<1800LEs) 218 Max DMIPS Soft core Proc 90nm Stratix II 2S60 Nios II Avalon™ Bus IP IP Nios II Courtesy Altera

Electronic System-Level (ESL) Design Automation Modeling SystemC -- OpenSource SystemVerilog Simulation and Verification Behavior-level simulation & verification System-level simulation & verification SystemC provides behavior-level and system-level synthesis capabilities for free -- rapidly gaining popularity Synthesis Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or Matlab) to RTL or netlists System-level synthesis: from system specification to system implementation

ESL Tools – A Lot of Interests …

Communication- and Interconnect-Centric Synthesis: Example: Use of Distributed Register-File Architectures Island A Data-Routing Logic Local Register File FUP MUX Functional Unit Pool MUL ALU ALU’ Island C Island B Input Buffers 3 2 4 1 1 2 4 3 Binding using discrete registers Distributed register-file micro-architecture: Efficiently use on-chip embedded memories Fully explore operation and data-transfer parallelism A scheduled DFG with register binding indicated on each variable (assume one-functional unit constraint) Binding using a register file: more efficient design!

Distributed Register-File Microarchitecture Island A Data-Routing Logic Local Register File FUP MUX Functional Unit Pool MUL ALU ALU’ Island C Island B Input Buffers On-chip memory blocks FP-SoC Island A Island C Island B Xilinx XC-2V 2000 3000 4000 6000 8000 #18Kb BRAM 56 96 120 144 168 Dist. RAM(Kb) 336 448 720 1,056 1,456 On-chip RAM resource on Virtex II

Resource Binding for DRF-Microarchitecture Intra-island transfers Facts under simplified assumptions Operations bound onto an island form a chain in the given scheduled DFG Inter-chain data transfers may share a physical inter-island connection The number of inter-island connections (IIC) is crucial to the QoR of a DRFM instance Inter-island transfers 1 v1 v6 2 v2 v7 3 v3 v9 4 v4 v5 v8 v10 Island (Chain) A B C D Inter-island connections = 5 (A,B)=(A,D)=1 (A,C)=1, two data transfers share one connection (C,D)=2

DRFM Binding Solution v3 v9 A B C D Overview: v3 A 1 1 1 v1 v6 1 B 2 2 v2 v7 C 2 v9 3 v3 v9 D 2 4 v4 v5 v8 v10 C-step 1, 2 handled. For c-step 3: Construct weighted bipartite graph: Edge weight = # new introduced inter-island connections (IIC) Min-weight matching  optimal binding in this step Solution of this step: Matching: V3 Island A; V9  Island C New introduced IIC # = 0 Island (Chain) A B C D Overview: In step-by-step fashion Use weighted bipartite-matching to solve each step optimally Final Inter-Island Connections = 4

DRF Experimental Results: Three Experimental Flows for Comparison xPilot Frontend xPilot behavioral synthesis system SSDM/CDFG Scheduling algorithms Scheduled CDFG (STG) 1) Binding on Discrete-Register Microarchitecture 2) Baseline (Random) DRF Binding 3) DRF Binding for Minimizing Inter-Island Connections RTL generation Xilinx Virtex II

DRF Experimental Results Xilinx ISE 7.1; Virtex II; Target clock period: 8ns The baseline DRF binding results achieve 46.70% slice reduction over the discrete-register approach Optimized DRF binding reduces 12.21% further Overall, more than 2X logic slice reduction with better clock period (7.8%). Area (Slices, DRF solutions use on-chip RAM blocks) Clock period (ns)

Preliminary Result of xPilot  Better QoR (Comparison with UCI/UCSD SPARK) Designs SPARK xPilot Delay Ratio Resource Usage Fmax xPilot /SPARK Slice DSP (MHz) (LUT) (FF) PR 588 981 247 92.85 331 416 564 16 146.84 1.58 WANG 660 1157 265 109.29 357 464 15 133.51 1.22 LEE 574 996 220 109.17 356 484 659 19 131.93 1.21 MCM 1062 1857 479 99.40 887 1207 1282 30 110.38 1.11 DIR 1323 2256 494 3 79.30 979 1002 1732 56 98.81 1.25 Ave Ratio 1 1.00 0.66 0.48 2.74 n/a 1.27 Device setting: Xilinx Virtex-II pro (xc2v4000 -6) Target frequency: 200 MHz

Proposed SCM Co-Optimization Design Flow Platform Description & Constraints Process Network Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior

Communication Order Detection Step 1. Construct a global CDFG by merging the individual CDFGs of each process Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG Process 1 Process 2 + T1 T2 T3  * Latency = 5 cycles Latency = 7 cycles Ti : FIFO

Loop Indices Compression Given the optimal order, we try to generate restructured loops for code compression i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system Solution: i’=j, j’ = i;

Initial Results of Interface Synthesis Target for sequential communication channels In particular, FSL in VirtexII Consider two communicating processes Total latency (Cycle#) RAs Compress Designs Trad. SCOOP Reduction Before After DCT1 325 290 10.77% Haar 142 134 5.63% DWT 689 617 10.45% Mat_mul 408 339 16.91% 96 20 DCT2 483 419 13.25% 80 64 Masking 620 420 32.26% 192 Dot 1903 1084 43.04% 300 An average of 26% improvement in total latency can be achieved.

MPEG-4 Simple Profile Decoder: Architecture Profiling C specification overview Module Name Orig. C Source File Orig. C line # Copy Controller copyControl.c 287 Display Controller displayControl.c 358 Motion Comp. Motion-Compensation.c 312 Parser /VLD parser.c 1092 texture_vld.c 508 Texture /IDCT texture_idct.c 1901 Texture Update textureUpdate.c 220 Runtime Profiling (PowerPC/XUP board) Parser/VLD 59.0% Texture/IDCT 18.1% Motion Comp. 15.7% Copy Controller 3.6%

MPEG-4 Simple Profile Decoder: Hyprid HW/SW Impmentation HW block Integrated with PowerPC single process design: 15% speed improvement Software blocks running on PowerPC

MPEG-4 Simple Profile Decoder: Alternate Implementations Single uBlaze 7-uBlaze Single PowerPC Single PowerPC w/ HW Motion Comp. Throughput (Frame per Second) 0.59 1.18 3.06 3.53 Improvement - + 209% + 68.4% + 15.3% xPilot Synthesis Report of HW blocks Line counts Slices ( FFs, LUTs) MUL Clock period (ns) Latency (Cycles) C RTL SystemC RTL VHDL Motion Comp. 210 9903 5655 986 (1111, 1017) 2 7.97 505 Block IDCT 200 9534 2731 1877 (2376, 2438) 26 7.963 280 Texture Update 160 8227 4475 1551 (1696, 1931) 4 7.913 335

Advantages of Our Scheduling Algorithm A highly versatile scheduling engine (UPS) Supports a wide spectrum of applications with high complexity Data-intensive, control-intensive, memory-intensive, mixed, etc. Honors a rich set of design constraints Resource constraints, relative timing constraints, frequency constraints, latency constraints, etc. Offers a variety of optimization techniques Operation chaining, pipelined multi-cycle operation, awareness of repetitions, behavioral templates, speculation, functional/loop pipelining, multi-cycle communication Accounts for physical reality Optimizes communications simultaneously with computations

Preliminary Results of xPilot  Rapid System Exploration Quick evaluation of various amounts of process level concurrency and different hardware/software boundaries Example: Motion-JPEG implementation All HW implementation All SW implementation (using embedded processors) SW/HW co-design: optimal partitioning? Repeated manual RTL coding is not solution!

Preliminary Results of xPilot Shorter Simulation/Verification Cycle From other projects: Simulation speed on behavior model 100X faster than RTL-based method [NEC, ASPDAC04] Our experience: Motion-compensation module in a Mpeg4-decoder Behavior level (in C language) simulation Less than 1 second per frame RTL SystemC simulation About 310 second per frame

Ongoing Work: Design Exploration for MPSoCs A scalable architecture simulation infrastructure for architecture evaluation & performance/power estimation Need for structural abstraction of processors and interconnects Recent work such as Liberty is an effort along this direction Complete structural abstraction makes the simulation very slow Liberty is about 10X slower than SimpleScalar on Itanium model Hybrid approach Tradeoff between accuracy and simulation time Model interconnection accurately using SystemC (for accuracy) Cores modeled using Simplescalar (for simulation speed) Communication network synthesis Automatic interface synthesis is required Physical planning is needed for interconnect latency/power estimation