Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis Framework Supported by Semiconductor Research Corporation & Intel Inc Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau
Copyright Sumit Gupta System Level Synthesis System Level Model Task Analysis HW/SW Partitioning ASIC Processor Core Memory FPGA I/O Hardware Behavioral Description Software Behavioral Description Software Compiler High Level Synthesis
Copyright Sumit Gupta High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :
Copyright Sumit Gupta High-level Synthesis Well-researched area: from early 1980’s Well-researched area: from early 1980’s Renewed interest due to new system level design methodologies Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Large number of synthesis optimizations have been proposed Either operation level: algebraic transformations on DSP codes Either operation level: algebraic transformations on DSP codes or logic level: Don’t Care based control optimizations or logic level: Don’t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Different optimization objectives and cost models than HLS Our aim: Develop Synthesis and Parallelizing Compiler Transformations that are “useful” for HLS Beyond scheduling results: in Circuit Area and Delay Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) For large designs with complex control flow (nested conditionals/loops)
Copyright Sumit Gupta Our Approach: Parallelizing HLS (PHLS) C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling
Copyright Sumit Gupta SPARK High Level Synthesis Framework
Copyright Sumit Gupta SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Script-based application of transformations, passes, and heuristics: similar to Synopsys Design Compiler Script-based application of transformations, passes, and heuristics: similar to Synopsys Design Compiler Hierarchical Intermediate Representation (HTGs) Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, loops) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Enables efficient and structured application of transformations Complete HLS tool: Does Resource Binding & Control Synthesis Complete HLS tool: Does Resource Binding & Control Synthesis Enables Graphical Visualization of Design description and intermediate results (CDFG, DFG, HTG) Enables Graphical Visualization of Design description and intermediate results (CDFG, DFG, HTG) Benchmarked on large set of multimedia & image processing designs Benchmarked on large set of multimedia & image processing designs SPARK System Release available for download SPARK System Release available for download User Manual for running tool and changing synthesis scripts User Manual for running tool and changing synthesis scripts Tutorial for the synthesis of a portion of a MPEG player Tutorial for the synthesis of a portion of a MPEG player 100,000+ lines of C++ code 100,000+ lines of C++ code
Copyright Sumit Gupta PHLS Transformations Organized into Four Groups Pre-Synthesis Source-to-Source Transformations Pre-Synthesis Source-to-Source Transformations Loop-Invariant Code Motions, Loop Unrolling, CSE Loop-Invariant Code Motions, Loop Unrolling, CSE Scheduling synthesis & compiler transformations Scheduling synthesis & compiler transformations Speculative Code Motions, Multi-cycling, Operation Chaining, Loop Shifting (Incremental Loop Pipelining technique) Speculative Code Motions, Multi-cycling, Operation Chaining, Loop Shifting (Incremental Loop Pipelining technique) Dynamic: Transformations applied dynamically during scheduling Dynamic: Transformations applied dynamically during scheduling Dynamic CSE & Copy Propagation, Dynamic Branch Balancing Dynamic CSE & Copy Propagation, Dynamic Branch Balancing Basic Compiler Transformations Basic Compiler Transformations Copy Propagation, Dead Code Elimination, constant propagation Copy Propagation, Dead Code Elimination, constant propagation Application of these transformations is guided by Synthesis Scripts
Copyright Sumit Gupta Experiments We used SPARK to synthesize designs derived from several industrial designs We used SPARK to synthesize designs derived from several industrial designs Example: MPEG-1, MPEG-2, GIMP Image Processing software Example: MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Case Study of Intel Instruction Length Decoder Quantified effects of individual transformations on QOR Quantified effects of individual transformations on QOR Pre-synthesis transformations Pre-synthesis transformations Speculative Code Motions, Loop Pipeliling Speculative Code Motions, Loop Pipeliling Dynamic Transformations Dynamic Transformations Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area
Copyright Sumit Gupta Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39% Overall: % improvement in Delay Almost constant Area
Copyright Sumit Gupta Example Design: ILD Block from Intel Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Characteristics of Microprocessor functional blocks Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Low Latency: Single or Dual cycle implementation Consist of several small computations Consist of several small computations Intermix of control and data logic Intermix of control and data logic Starting with a sequential, multi-cycle specification, we achieved a fully parallel, single-cycle design Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel
Copyright Sumit Gupta Key Insights from Project Coarse-grain and Fine-grain Parallelizing transformations and basic compiler transformations are essential and key to achieving high quality of synthesis results Coarse-grain and Fine-grain Parallelizing transformations and basic compiler transformations are essential and key to achieving high quality of synthesis results Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the level of behavioral C Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the level of behavioral C Also important for coarse-grain design space exploration Also important for coarse-grain design space exploration Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired results Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired results Transformations from compilers and parallelizing compilers do not directly translate over to synthesis Transformations from compilers and parallelizing compilers do not directly translate over to synthesis Need to be radically changed with completely different cost models and guiding principles Need to be radically changed with completely different cost models and guiding principles New parallelizing transformations (or transformations that are not useful for compilers) have to be developed for synthesis New parallelizing transformations (or transformations that are not useful for compilers) have to be developed for synthesis
Copyright Sumit Gupta Key Insights from Project Designers want script based control over transformations, passes – similar to Synopsys Design Compiler Designers want script based control over transformations, passes – similar to Synopsys Design Compiler Designer Insights can be used to guide transformations – especially coarse- grain code restructuring for design space exploration Designer Insights can be used to guide transformations – especially coarse- grain code restructuring for design space exploration Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer critical paths, i.e., clock period) Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer critical paths, i.e., clock period) For example, loop unrolling and loop pipelining: they increase the number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increase For example, loop unrolling and loop pipelining: they increase the number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increase Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex control flow Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex control flow A Hierarchical intermediate representation (Hierarchical Task Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizations A Hierarchical intermediate representation (Hierarchical Task Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizations Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C. Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C.
Copyright Sumit Gupta Conclusions Parallelizing code transformations enable a new range of HLS transformations Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Can enable productivity improvements in microelectronic design Built a C-to-VHDL synthesis system with a range of code transformations Built a C-to-VHDL synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Tool-box approach where transformations and heuristics can be developed Enables the designer to find the right synthesis script for different application domains Enables the designer to find the right synthesis script for different application domains Performance improvements of % across a number of designs Performance improvements of % across a number of designs We have shown its effectiveness on an Intel design We have shown its effectiveness on an Intel design
Copyright Sumit Gupta SPARK Release Available for download Available for download User Manual User Manual Running the tool Running the tool Customizing the synthesis scripts Customizing the synthesis scripts Tutorial Tutorial Synthesis of Portion of the Motion Compensation algorithm in MPEG-1 player Synthesis of Portion of the Motion Compensation algorithm in MPEG-1 player
Thank You
Copyright Sumit Gupta Ongoing Work: Interface Synthesis Co-Design Targeting a FPGA Platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform C Input MPEG-1 Pred Block Execution Profiling Manual HW/SW Partitioning Processor Core MemoryI/O Hardware C Description Software C Description Software Compiler SPARK High-Level Synthesis FPGA FPGA Platform Interface Synthesis
Copyright Sumit Gupta Future Plans or What is Missing Need for ability to specify timing of signals Need for ability to specify timing of signals Interface with logic synthesis tools to enable better module selection, operator chaining/merging Interface with logic synthesis tools to enable better module selection, operator chaining/merging Time-constrained synthesis Time-constrained synthesis Power Analysis of parallelizing optimizations Power Analysis of parallelizing optimizations More transformations such as loop fusion, range analysis required More transformations such as loop fusion, range analysis required
Copyright Sumit Gupta Synthesizable C ANSI-C front end from Edison Design Group (EDG) ANSI-C front end from Edison Design Group (EDG) Features of C not supported for synthesis Features of C not supported for synthesis Pointers Pointers However, Arrays and passing by reference are supported However, Arrays and passing by reference are supported Recursive Function Calls Recursive Function Calls Gotos Gotos Features for which support has not been implemented Features for which support has not been implemented Multi-dimensional arrays Multi-dimensional arrays Structs Structs Continue, Breaks Continue, Breaks Hardware component generated for each function Hardware component generated for each function A called function is instantiated as a hardware component in calling function A called function is instantiated as a hardware component in calling function
Copyright Sumit Gupta HTGDFG Graph Visualization
Copyright Sumit Gupta Resource Utilization Graph Scheduling
Copyright Sumit Gupta Example of Complex HTG Example of a real design: MPEG-1 pred2 function Example of a real design: MPEG-1 pred2 function Just for demonstration; you are not expected to read the text Just for demonstration; you are not expected to read the text Multiple nested loops and conditionals Multiple nested loops and conditionals
Copyright Sumit Gupta Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-2 dp_frame GIMPtiler
Copyright Sumit Gupta Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: % improvement in Delay Almost constant Area
Copyright Sumit Gupta Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Length Decoder First Insn Second Insn Third Instruction Instruction Buffer
Copyright Sumit Gupta ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel