Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Altera FLEX 10K technology in Real Time Application.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level.
Configurable System-on-Chip: Xilinx EDK
Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
Center for Embedded Computer Systems University of California, Irvine Dynamic Common Sub-Expression Elimination during Scheduling.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Extreme Makeover for EDA Industry
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
System Design with CoWare N2C - Overview. 2 Agenda q Overview –CoWare background and focus –Understanding current design flows –CoWare technology overview.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
VLSI DESIGN CONFERENCE 1998 TUTORIAL Embedded System Design and Validation: Building Systems from IC cores to Chips Rajesh Gupta University of California,
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
PTII Model  VHDL Codegen Verification Project Overview 1.Generate VHDL descriptions for Ptolemy models. 2.Maintain bit and cycle accuracy in implementation.
Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
System-on-Chip Design Hao Zheng Comp Sci & Eng U of South Florida 1.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
System-on-Chip Design
Introduction to cosynthesis Rabi Mahapatra CSCE617
Presentation transcript:

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of FPGA Blocks using Parallelizing Code Transformations Supported by Semiconductor Research Corporation Sumit Gupta Manev Luthra, Nikil Dutt, Alex Nicolau Rajesh Gupta

Copyright Sumit Gupta Overview of a HW/SW Co-design Flow Processor Core Memory I/O Hardware Behavioral Description Software Behavioral Description Software Compiler High Level Synthesis Application Specs HW/SW Partitioning Interface Synthesis Logic Synthesis and P&R C RTL VHDL ASICFPGA

Copyright Sumit Gupta Outline Motivation and Background Motivation and Background Our Approach to Parallelizing High-Level Synthesis Our Approach to Parallelizing High-Level Synthesis Our Approach to Interface Synthesis using Memory Mapping Our Approach to Interface Synthesis using Memory Mapping Experimental Results Experimental Results HLS: Multimedia and Image Processing Applications HLS: Multimedia and Image Processing Applications Interface Synthesis: Case Study of MPEG-1 block Interface Synthesis: Case Study of MPEG-1 block Conclusions and Future Work Conclusions and Future Work

Copyright Sumit Gupta High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

Copyright Sumit Gupta High-level Synthesis Well-researched area: from early 1980’s Well-researched area: from early 1980’s Renewed interest due to new system level design methodologies Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Large number of synthesis optimizations have been proposed Either operation level: algebraic transformations on DSP codes Either operation level: algebraic transformations on DSP codes or logic level: Don’t Care based control optimizations or logic level: Don’t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Different optimization objectives and cost models than HLS  Our aim: Develop Synthesis and Parallelizing Compiler Transformations that are “useful” for HLS Beyond scheduling results: in Circuit Area and Delay Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) For large designs with complex control flow (nested conditionals/loops)

Copyright Sumit Gupta Our Approach: Parallelizing HLS (PHLS) C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling

Copyright Sumit Gupta SPARK High Level Synthesis Framework

Copyright Sumit Gupta SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Script based control over transformations & heuristics Script based control over transformations & heuristics Hierarchical Intermediate Representation (HTGs) Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, loops) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Enables efficient and structured application of transformations Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results Enables Graphical Visualization of Design description and intermediate results 100,000+ lines of C++ code 100,000+ lines of C++ code

Copyright Sumit Gupta Interface Synthesis “Glue” logic to make the hardware and software talk to each other “Glue” logic to make the hardware and software talk to each other Data must be transferred or shared between the hardware and software Data must be transferred or shared between the hardware and software Our approach to Interface Synthesis: Our approach to Interface Synthesis: Map common data to a shared memory Map common data to a shared memory Our focus is on optimizing the number and size of memory banks used for mapping Our focus is on optimizing the number and size of memory banks used for mapping Target architecture: Platform with FPGA and Processor core on same FPGA fabric Target architecture: Platform with FPGA and Processor core on same FPGA fabric Take advantage of the Memory Blocks available on such FPGA platforms for implementing the shared memory Take advantage of the Memory Blocks available on such FPGA platforms for implementing the shared memory

Copyright Sumit Gupta The Interface Hardware Interface Logic CPU Bus Processor Core Peripheral Peripheral Custom Logic Shared Memory Main Memory Interface Hardware System architecture illustrating the interface HW System architecture illustrating the interface HW Data exchanged between HW and SW resides in the shared memory

Copyright Sumit Gupta Target HW Interface Architecture Target HW Interface Architecture Control Logic Memory Controller Bus Interface Appl. Kernel Mapped Local Memory FPGA Fabric Processor Core CPU Bus Multiple PortsSingle Port

Copyright Sumit Gupta Local Memory An Example of Memory Mapping Inefficient use of the FPGA fabric by implementing independent registers Var N Var 2Var 1 FU 1 FU M N:2M Mux Var 1 Var 2 FU 1 FU M K:2M Mux Var 3 Mem 1 Efficient use of FPGA resources by utilizing its abundant memory blocks Var N Mem 2 Mem K

Copyright Sumit Gupta Mapping Designs with Control Flow Mapping variables to memory banks is a well- known technique Mapping variables to memory banks is a well- known technique What’s new here: Use of What’s new here: Use of Scheduling info from High Level Synthesis tool Scheduling info from High Level Synthesis tool Mutual exclusivity info in designs with control flow Mutual exclusivity info in designs with control flow Data access patterns resolved only during run time Data access patterns resolved only during run time So how do we perform memory mapping at compile time ? So how do we perform memory mapping at compile time ? If Node TF BB 2 BB 1 BB 3 a b a c b c a b C0 C1 Access Pattern: {(a, b), (a, c)} or {(a, b, c), (a, b)} {(a, b, c), (a, b)} Condition resolved at run time a

Copyright Sumit Gupta Our Approach to Memory Mapping We use We use Scheduling information from SPARK Synthesis tool Scheduling information from SPARK Synthesis tool Cycles in which variables are accessed (read/write) Cycles in which variables are accessed (read/write) Mutual exclusivity information from the control flow graph Mutual exclusivity information from the control flow graph The conditions under which variables are accessed The conditions under which variables are accessed Capture this information using Multi-color Conflict Graphs Capture this information using Multi-color Conflict Graphs Then create a Variable to Memory Bank Mapping that minimizes the number of memory instances Then create a Variable to Memory Bank Mapping that minimizes the number of memory instances Details available in ICCD 2003 paper Details available in ICCD 2003 paper

Copyright Sumit Gupta Experiments for SPARK HLS Tool Results presented here for Results presented here for Pre-synthesis transformations Pre-synthesis transformations Speculative Code Motions Speculative Code Motions Dynamic CSE Dynamic CSE We used SPARK to synthesize designs derived from several industrial designs We used SPARK to synthesize designs derived from several industrial designs MPEG-1, MPEG-2, GIMP Image Processing software MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Case Study of Intel Instruction Length Decoder Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area

Copyright Sumit Gupta Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-2 dp_frame GIMPtiler

Copyright Sumit Gupta Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Interface Synthesis: Case Study Target Platform – Altera’s Nios development MHz Target Platform – Altera’s Nios development MHz Soft core of the Nios embedded processor (no caches) Soft core of the Nios embedded processor (no caches) Library of peripheral IP’s like timer, UART, SRAM controller Library of peripheral IP’s like timer, UART, SRAM controller Altera FPGA with 8320 Logic Cells Altera FPGA with 8320 Logic Cells Tools Tools The SPARK parallelizing high level synthesis framework for high level synthesis The SPARK parallelizing high level synthesis framework for high level synthesis Leonardo Spectrum and Quartus II for logic synthesis and P&R respectively Leonardo Spectrum and Quartus II for logic synthesis and P&R respectively The Nios specific GNU toolkit for compiling C code The Nios specific GNU toolkit for compiling C code

Copyright Sumit Gupta Example: MPEG-1 Prediction Block Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-1 pred1 used 1 out of 3 kernels mapped to HW MPEG-1 pred1 used 1 out of 3 kernels mapped to HW MPEG-2 pred2 used the remaining 2 kernels mapped to HW MPEG-2 pred2 used the remaining 2 kernels mapped to HW HW/SW interface consisted of 181 words of data HW/SW interface consisted of 181 words of data

Copyright Sumit Gupta Conclusions Parallelizing code transformations enable a new range of HLS transformations Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Can enable productivity improvements in microelectronic design Built a C-to-VHDL synthesis system with a range of code transformations Built a C-to-VHDL synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Tool-box approach where transformations and heuristics can be developed Script-based synthesis enables designer to control HLS process Script-based synthesis enables designer to control HLS process Performance improvements of % across a number of designs Performance improvements of % across a number of designs Presented an interface synthesis approach targeted at FPGA based platforms Presented an interface synthesis approach targeted at FPGA based platforms Based on a memory mapping algorithm that utilizes scheduling information for synthesizing designs with control flow Based on a memory mapping algorithm that utilizes scheduling information for synthesizing designs with control flow

Copyright Sumit Gupta SPARK Release Available for download Available for download User Manual User Manual Running the tool Running the tool Customizing the synthesis scripts Customizing the synthesis scripts Tutorial Tutorial Synthesis of Portion of Motion Compensation algorithm in MPEG-1 player (along with logic synthesis scripts) Synthesis of Portion of Motion Compensation algorithm in MPEG-1 player (along with logic synthesis scripts)

Thank You