Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Slides:

Advertisements

Similar presentations

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

Advertisements

ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level.

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level.

Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Center for Embedded Computer Systems University of California, Irvine Dynamic Common Sub-Expression Elimination during Scheduling.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli

SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

System-on-Chip Design

Code Optimization.

ESE532: System-on-a-Chip Architecture

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Architectural-Level Synthesis

THE ECE 554 XILINX DESIGN PROCESS

Dynamic Hardware Prediction

ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.

THE ECE 554 XILINX DESIGN PROCESS

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis Framework Supported by Semiconductor Research Corporation & Intel Inc Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau

Copyright Sumit Gupta System Level Synthesis System Level Model Task Analysis HW/SW Partitioning ASIC Processor Core Memory FPGA I/O Hardware Behavioral Description Software Behavioral Description Software Compiler High Level Synthesis

Copyright Sumit Gupta High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

Copyright Sumit Gupta High-level Synthesis Well-researched area: from early 1980’s Well-researched area: from early 1980’s Renewed interest due to new system level design methodologies Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Large number of synthesis optimizations have been proposed Either operation level: algebraic transformations on DSP codes Either operation level: algebraic transformations on DSP codes or logic level: Don’t Care based control optimizations or logic level: Don’t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Different optimization objectives and cost models than HLS  Our aim: Develop Synthesis and Parallelizing Compiler Transformations that are “useful” for HLS Beyond scheduling results: in Circuit Area and Delay Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) For large designs with complex control flow (nested conditionals/loops)

Copyright Sumit Gupta Our Approach: Parallelizing HLS (PHLS) C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling

Copyright Sumit Gupta SPARK High Level Synthesis Framework

Copyright Sumit Gupta SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Script-based application of transformations, passes, and heuristics: similar to Synopsys Design Compiler Script-based application of transformations, passes, and heuristics: similar to Synopsys Design Compiler Hierarchical Intermediate Representation (HTGs) Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, loops) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Enables efficient and structured application of transformations Complete HLS tool: Does Resource Binding & Control Synthesis Complete HLS tool: Does Resource Binding & Control Synthesis Enables Graphical Visualization of Design description and intermediate results (CDFG, DFG, HTG) Enables Graphical Visualization of Design description and intermediate results (CDFG, DFG, HTG) Benchmarked on large set of multimedia & image processing designs Benchmarked on large set of multimedia & image processing designs SPARK System Release available for download SPARK System Release available for download User Manual for running tool and changing synthesis scripts User Manual for running tool and changing synthesis scripts Tutorial for the synthesis of a portion of a MPEG player Tutorial for the synthesis of a portion of a MPEG player 100,000+ lines of C++ code 100,000+ lines of C++ code

Copyright Sumit Gupta PHLS Transformations Organized into Four Groups Pre-Synthesis Source-to-Source Transformations Pre-Synthesis Source-to-Source Transformations Loop-Invariant Code Motions, Loop Unrolling, CSE Loop-Invariant Code Motions, Loop Unrolling, CSE Scheduling synthesis & compiler transformations Scheduling synthesis & compiler transformations Speculative Code Motions, Multi-cycling, Operation Chaining, Loop Shifting (Incremental Loop Pipelining technique) Speculative Code Motions, Multi-cycling, Operation Chaining, Loop Shifting (Incremental Loop Pipelining technique) Dynamic: Transformations applied dynamically during scheduling Dynamic: Transformations applied dynamically during scheduling Dynamic CSE & Copy Propagation, Dynamic Branch Balancing Dynamic CSE & Copy Propagation, Dynamic Branch Balancing Basic Compiler Transformations Basic Compiler Transformations Copy Propagation, Dead Code Elimination, constant propagation Copy Propagation, Dead Code Elimination, constant propagation Application of these transformations is guided by Synthesis Scripts

Copyright Sumit Gupta Experiments We used SPARK to synthesize designs derived from several industrial designs We used SPARK to synthesize designs derived from several industrial designs Example: MPEG-1, MPEG-2, GIMP Image Processing software Example: MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Case Study of Intel Instruction Length Decoder Quantified effects of individual transformations on QOR Quantified effects of individual transformations on QOR Pre-synthesis transformations Pre-synthesis transformations Speculative Code Motions, Loop Pipeliling Speculative Code Motions, Loop Pipeliling Dynamic Transformations Dynamic Transformations Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area

Copyright Sumit Gupta Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Example Design: ILD Block from Intel Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Characteristics of Microprocessor functional blocks Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Low Latency: Single or Dual cycle implementation Consist of several small computations Consist of several small computations Intermix of control and data logic Intermix of control and data logic  Starting with a sequential, multi-cycle specification, we achieved a fully parallel, single-cycle design Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel

Copyright Sumit Gupta Key Insights from Project Coarse-grain and Fine-grain Parallelizing transformations and basic compiler transformations are essential and key to achieving high quality of synthesis results Coarse-grain and Fine-grain Parallelizing transformations and basic compiler transformations are essential and key to achieving high quality of synthesis results Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the level of behavioral C Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the level of behavioral C Also important for coarse-grain design space exploration Also important for coarse-grain design space exploration Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired results Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired results Transformations from compilers and parallelizing compilers do not directly translate over to synthesis Transformations from compilers and parallelizing compilers do not directly translate over to synthesis Need to be radically changed with completely different cost models and guiding principles Need to be radically changed with completely different cost models and guiding principles New parallelizing transformations (or transformations that are not useful for compilers) have to be developed for synthesis New parallelizing transformations (or transformations that are not useful for compilers) have to be developed for synthesis

Copyright Sumit Gupta Key Insights from Project Designers want script based control over transformations, passes – similar to Synopsys Design Compiler Designers want script based control over transformations, passes – similar to Synopsys Design Compiler Designer Insights can be used to guide transformations – especially coarse- grain code restructuring for design space exploration Designer Insights can be used to guide transformations – especially coarse- grain code restructuring for design space exploration Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer critical paths, i.e., clock period) Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer critical paths, i.e., clock period) For example, loop unrolling and loop pipelining: they increase the number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increase For example, loop unrolling and loop pipelining: they increase the number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increase Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex control flow Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex control flow A Hierarchical intermediate representation (Hierarchical Task Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizations A Hierarchical intermediate representation (Hierarchical Task Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizations Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C. Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C.

Copyright Sumit Gupta Conclusions Parallelizing code transformations enable a new range of HLS transformations Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Can enable productivity improvements in microelectronic design Built a C-to-VHDL synthesis system with a range of code transformations Built a C-to-VHDL synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Tool-box approach where transformations and heuristics can be developed Enables the designer to find the right synthesis script for different application domains Enables the designer to find the right synthesis script for different application domains Performance improvements of % across a number of designs Performance improvements of % across a number of designs We have shown its effectiveness on an Intel design We have shown its effectiveness on an Intel design

Copyright Sumit Gupta SPARK Release Available for download Available for download User Manual User Manual Running the tool Running the tool Customizing the synthesis scripts Customizing the synthesis scripts Tutorial Tutorial Synthesis of Portion of the Motion Compensation algorithm in MPEG-1 player Synthesis of Portion of the Motion Compensation algorithm in MPEG-1 player

Thank You

Copyright Sumit Gupta Ongoing Work: Interface Synthesis Co-Design Targeting a FPGA Platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform C Input MPEG-1 Pred Block Execution Profiling Manual HW/SW Partitioning Processor Core MemoryI/O Hardware C Description Software C Description Software Compiler SPARK High-Level Synthesis FPGA FPGA Platform Interface Synthesis

Copyright Sumit Gupta Future Plans or What is Missing Need for ability to specify timing of signals Need for ability to specify timing of signals Interface with logic synthesis tools to enable better module selection, operator chaining/merging Interface with logic synthesis tools to enable better module selection, operator chaining/merging Time-constrained synthesis Time-constrained synthesis Power Analysis of parallelizing optimizations Power Analysis of parallelizing optimizations More transformations such as loop fusion, range analysis required More transformations such as loop fusion, range analysis required

Copyright Sumit Gupta Synthesizable C ANSI-C front end from Edison Design Group (EDG) ANSI-C front end from Edison Design Group (EDG) Features of C not supported for synthesis Features of C not supported for synthesis Pointers Pointers However, Arrays and passing by reference are supported However, Arrays and passing by reference are supported Recursive Function Calls Recursive Function Calls Gotos Gotos Features for which support has not been implemented Features for which support has not been implemented Multi-dimensional arrays Multi-dimensional arrays Structs Structs Continue, Breaks Continue, Breaks Hardware component generated for each function Hardware component generated for each function A called function is instantiated as a hardware component in calling function A called function is instantiated as a hardware component in calling function

Copyright Sumit Gupta HTGDFG Graph Visualization

Copyright Sumit Gupta Resource Utilization Graph Scheduling

Copyright Sumit Gupta Example of Complex HTG Example of a real design: MPEG-1 pred2 function Example of a real design: MPEG-1 pred2 function Just for demonstration; you are not expected to read the text Just for demonstration; you are not expected to read the text Multiple nested loops and conditionals Multiple nested loops and conditionals

Copyright Sumit Gupta Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-2 dp_frame GIMPtiler

Copyright Sumit Gupta Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Length Decoder First Insn Second Insn Third Instruction Instruction Buffer

Copyright Sumit Gupta ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel