Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Program Representations. Representing programs Goals.
08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level.
Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
Center for Embedded Computer Systems University of California, Irvine Dynamic Common Sub-Expression Elimination during Scheduling.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
ICD-C Compiler Framework Dr. Heiko Falk  H. Falk, ICD/ES, 2008 ICD-C Compiler Framework 1.Highlights and Features 2.Basic Concepts 3.Extensions.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
System-on-Chip Design
Code Optimization.
Improving cache performance of MPEG video codec
Introduction to cosynthesis Rabi Mahapatra CSCE617
Instruction Scheduling for Instruction-Level Parallelism
CSCI1600: Embedded and Real Time Software
Chapter 12 Pipelining and RISC
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level Synthesis Framework Supported by Semiconductor Research Corporation & Intel Inc Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau

Copyright Sumit Gupta System Level Synthesis System Level Model Task Analysis HW/SW Partitioning ASIC Processor Core Memory FPGA I/O Hardware Behavioral Description Software Behavioral Description Software Compiler High Level Synthesis

Copyright Sumit Gupta High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

Copyright Sumit Gupta Outline Motivation and Background Motivation and Background Our Approach to Parallelizing High-Level Synthesis Our Approach to Parallelizing High-Level Synthesis Code Transformations Techniques for PHLS Code Transformations Techniques for PHLS Parallelizing Transformations Parallelizing Transformations Dynamic Transformations Dynamic Transformations The PHLS Framework and Experimental Results The PHLS Framework and Experimental Results Multimedia and Image Processing Applications Multimedia and Image Processing Applications Case Study: Intel Instruction Length Decoder Case Study: Intel Instruction Length Decoder Conclusions and Future Work Conclusions and Future Work  Compilation for Coarse-Grain Reconfigurable Archs

Copyright Sumit Gupta High-level Synthesis Well-researched area: from early 1980’s Well-researched area: from early 1980’s Renewed interest due to new system level design methodologies Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Large number of synthesis optimizations have been proposed Either operation level: algebraic transformations on DSP codes Either operation level: algebraic transformations on DSP codes or logic level: Don’t Care based control optimizations or logic level: Don’t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Different optimization objectives and cost models than HLS  Our aim: Develop Synthesis and Parallelizing Compiler Transformations that are “useful” for HLS Beyond scheduling results: in Circuit Area and Delay Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) For large designs with complex control flow (nested conditionals/loops)

Copyright Sumit Gupta Our Approach: Parallelizing HLS (PHLS) C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling

Copyright Sumit Gupta PHLS Transformations Organized into Four Groups 1.Pre-synthesis: Loop-invariant code motions, Loop unrolling, CSE 2.Scheduling: Speculative Code Motions, Multi- cycling, Operation Chaining, Loop Pipelining 3.Dynamic: Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch Balancing 4.Basic Compiler Transformations: Copy Propagation, Dead Code Elimination

Copyright Sumit Gupta Speculative Code Motions + + If Node TF Reverse Speculation Conditional Speculation Across Hierarchical Blocks _ a b c Operation Movement to reduce impact of Programming Style on Quality of HLS Results

Copyright Sumit Gupta Speculative Code Motions + + If Node TF Reverse Speculation Conditional Speculation Across Hierarchical Blocks _ a b c Operation Movement to reduce impact of Programming Style on Quality of HLS Results Early Condition Execution Evaluates conditions As soon as possible

Copyright Sumit Gupta Dynamic Transformations Called “dynamic” since they are applied during scheduling (versus a pass before/after scheduling) Called “dynamic” since they are applied during scheduling (versus a pass before/after scheduling) Dynamic Branch Balancing Dynamic Branch Balancing Increase the scope of code motions Increase the scope of code motions Reduce impact of programming style on HLS results Reduce impact of programming style on HLS results Dynamic CSE and Dynamic Copy Propagation Dynamic CSE and Dynamic Copy Propagation Exploit the Operation movement and duplication due to speculative code motions Exploit the Operation movement and duplication due to speculative code motions Create new opportunities to apply these transformations Create new opportunities to apply these transformations Reduce the number of operations Reduce the number of operations

Copyright Sumit Gupta Dynamic Branch Balancing If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation Original Design If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d Scheduled Design Unbalanced Conditional Longest Path

Copyright Sumit Gupta Insert New Scheduling Step in Shorter Branch If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation Original DesignScheduled Design

Copyright Sumit Gupta Insert New Scheduling Step in Shorter Branch If Node TF BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation e __ e Original DesignScheduled Design Dynamic Branch Balancing inserts new scheduling steps Enables Conditional Speculation Enables Conditional Speculation Leads to further code compaction Leads to further code compaction

Copyright Sumit Gupta Dynamic CSE: Going beyond Traditional CSE a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; C Description BB 2BB 3 BB 1 d = b + c BB 4 a = b + c e = g + h HTG Representation If Node T F BB 0 BB 2BB 3 BB 1 d = a BB 4 a = b + c e = g + h After Traditional CSE If Node TF BB 0

Copyright Sumit Gupta a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; C Description BB 2BB 3 BB 1 d = b + c BB 4 a = b + c e = g + h HTG Representation If Node T F BB 0 BB 2BB 3 BB 1 d = a BB 4 a = b + c e = g + h After CSE If Node TF BB 0 We use notion of Dominance of Basic Blocks We use notion of Dominance of Basic Blocks Basic block BBi dominates BBj if all control paths from the initial basic block of the design graph leading to BBj goes through BBi Basic block BBi dominates BBj if all control paths from the initial basic block of the design graph leading to BBj goes through BBi We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj Dynamic CSE: Going beyond Traditional CSE

Copyright Sumit Gupta New Opportunities for “Dynamic” CSE Due to Code Motions BB 2BB 3 BB 1 a = b + c BB 6BB 7 BB 5 d = b + c BB 4 BB 8 Scheduler decides to Speculate BB 2BB 3 BB 1 a = dcse BB 6BB 7 BB 5 d = b + c BB 4 BB 8 dcse = b + c BB 0 CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 dominates BB6

Copyright Sumit Gupta BB 2BB 3 BB 1 a = b + c BB 6BB 7 BB 5 d = b + c BB 4 BB 8 BB 2BB 3 BB 1 a = dcse BB 6BB 7 BB 5 d = dcse BB 4 BB 8 dcse = b + c BB 0 Scheduler decides to Speculate New Opportunities for “Dynamic” CSE Due to Code Motions CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 dominates BB6 If scheduler moves or duplicates an operation op, apply CSE on remaining operations using op

Copyright Sumit Gupta Condition Speculation & Dynamic CSE BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = b + c BB 3 BB 7 d = b + c BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = a' BB 3 BB 7 a' = b + c d = b + c BB 8 Scheduler decides to Conditionally Speculate

Copyright Sumit Gupta Condition Speculation & Dynamic CSE BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = b + c BB 3 BB 7 d = b + c BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = a' BB 3 BB 7 a' = b + c d = a' BB 8 Scheduler decides to Conditionally Speculate Use the notion of dominance by groups of basic blocks Use the notion of dominance by groups of basic blocks => BB1 and BB2 together dominate BB8 All Control Paths leading up to BB8 come from either BB1 or BB2: => BB1 and BB2 together dominate BB8

Copyright Sumit Gupta Loop Shifting: An Incremental Loop Pipelining Technique BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 a + c _ a + c _ LoopShifting a + c _

Copyright Sumit Gupta Loop Shifting: An Incremental Loop Pipelining Technique BB 0 a + b c _ + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 a + c _ a + c _LoopShifting Compac -tion

Copyright Sumit Gupta SPARK High Level Synthesis Framework

Copyright Sumit Gupta SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Script based control over transformations & heuristics Script based control over transformations & heuristics Hierarchical Intermediate Representation (HTGs) Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, loops) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Enables efficient and structured application of transformations Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results Enables Graphical Visualization of Design description and intermediate results 100,000+ lines of C++ code 100,000+ lines of C++ code

Copyright Sumit Gupta Synthesizable C ANSI-C front end from Edison Design Group (EDG) ANSI-C front end from Edison Design Group (EDG) Features of C not supported for synthesis Features of C not supported for synthesis Pointers Pointers However, Arrays and passing by reference are supported However, Arrays and passing by reference are supported Recursive Function Calls Recursive Function Calls Gotos Gotos Features for which support has not been implemented Features for which support has not been implemented Multi-dimensional arrays Multi-dimensional arrays Structs Structs Continue, Breaks Continue, Breaks Hardware component generated for each function Hardware component generated for each function A called function is instantiated as a hardware component in calling function A called function is instantiated as a hardware component in calling function

Copyright Sumit Gupta HTGDFG Graph Visualization

Copyright Sumit Gupta Resource Utilization Graph Scheduling

Copyright Sumit Gupta Example of Complex HTG Example of a real design: MPEG-1 pred2 function Example of a real design: MPEG-1 pred2 function Just for demonstration; you are not expected to read the text Just for demonstration; you are not expected to read the text Multiple nested loops and conditionals Multiple nested loops and conditionals

Copyright Sumit Gupta Experiments Results presented here for Results presented here for Pre-synthesis transformations Pre-synthesis transformations Speculative Code Motions Speculative Code Motions Dynamic CSE Dynamic CSE We used SPARK to synthesize designs derived from several industrial designs We used SPARK to synthesize designs derived from several industrial designs MPEG-1, MPEG-2, GIMP Image Processing software MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Case Study of Intel Instruction Length Decoder Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area

Copyright Sumit Gupta Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-2 dp_frame GIMPtiler

Copyright Sumit Gupta Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: % improvement in Delay Almost constant Area

Copyright Sumit Gupta Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Length Decoder First Insn Second Insn Third Instruction Instruction Buffer

Copyright Sumit Gupta Example Design: ILD Block from Intel Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Decodes length of instructions streaming from memory Decodes length of instructions streaming from memory Has to look at up to 4 bytes at a time Has to look at up to 4 bytes at a time Has to execute in one cycle and decode about 64 bytes of instructions Has to execute in one cycle and decode about 64 bytes of instructions  Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Low Latency: Single or Dual cycle implementation Consist of several small computations Consist of several small computations Intermix of control and data logic Intermix of control and data logic

Copyright Sumit Gupta ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel

Copyright Sumit Gupta Conclusions Parallelizing code transformations enable a new range of HLS transformations Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Can enable productivity improvements in microelectronic design Built a synthesis system with a range of code transformations Built a synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Tool-box approach where transformations and heuristics can be developed Enables the designer to find the right synthesis script for different application domains Enables the designer to find the right synthesis script for different application domains Performance improvements of % across a number of designs Performance improvements of % across a number of designs We have shown its effectiveness on an Intel design We have shown its effectiveness on an Intel design

Copyright Sumit Gupta Acknowledgements Advisors Advisors Professors Rajesh Gupta, Nikil Dutt, Alex Nicolau Professors Rajesh Gupta, Nikil Dutt, Alex Nicolau Contributors to SPARK framework Contributors to SPARK framework Nick Savoiu, Mehrdad Reshadi, Sunwoo Kim Nick Savoiu, Mehrdad Reshadi, Sunwoo Kim Intel Strategic CAD Labs (SCL) Intel Strategic CAD Labs (SCL) Timothy Kam, Mike Kishinevsky Timothy Kam, Mike Kishinevsky Supported by Semiconductor Research Corporation and Intel SCL Supported by Semiconductor Research Corporation and Intel SCL

Copyright Sumit Gupta SPARK Release Binaries for Linux, Solaris, and Windows are available at: Binaries for Linux, Solaris, and Windows are available at: Includes Includes User Manual User Manual Tutorial of MPEG-1 Player Tutorial of MPEG-1 Player

Copyright Sumit Gupta Compilation for Coarse-Grain Reconfigurable Architectures

Copyright Sumit Gupta Use of Coarse-Grain Reconfigurable Fabrics as Co-Processors DMACPUDSP $ I/P O/P CoProc Focus of this work: Build a compiler framework to map applications to Coarse Grain Reconfigurable Architectures These architectures consist Processing Elements (PEs) connected together in a network

Copyright Sumit Gupta Different Topology Traversals (a) Zig-Zag (c) Spiral(b) Reverse-S  Spiral traversal performs best because: Each PE traversed is adjacent to the previous PE Each PE traversed is adjacent to the previous PE PEs with more neighbors/connections are traversed first PEs with more neighbors/connections are traversed first

Copyright Sumit Gupta Ongoing Work Different ways to traverse topology during scheduling Different ways to traverse topology during scheduling Effect of Different Connection Delay Models Effect of Different Connection Delay Models As technology improves, processing elements (PEs) will be much faster than the connections between them As technology improves, processing elements (PEs) will be much faster than the connections between them Effects of Different PE configurations Effects of Different PE configurations Different number of functional units per PE Different number of functional units per PE Different Cycle time for Adds versus Multiplies Different Cycle time for Adds versus Multiplies Taking Memory Bandwidth & Registers into consideration during Scheduling Taking Memory Bandwidth & Registers into consideration during Scheduling

Thank You

Copyright Sumit Gupta Case Study: Implementation of MPEG-1 Prediction Block on a FPGA Platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform Developed novel memory mapping algorithm to fit memory elements/ application onto FPGA platform In collaboration with Manev Luthra In collaboration with Manev Luthra C Input MPEG-1 Pred Block Execution Profiling Manual HW/SW Partitioning Processor Core MemoryI/O Hardware C Description Software C Description Software Compiler SPARK High-Level Synthesis FPGA FPGA Platform

Copyright Sumit Gupta Recent Related Work Mostly related to code scheduling in the presence of conditionals Mostly related to code scheduling in the presence of conditionals Condition Vector List Scheduling [Wakabayashi 89] Condition Vector List Scheduling [Wakabayashi 89] Symbolic Scheduling [Radivojevic 96] Symbolic Scheduling [Radivojevic 96] WaveSched Scheduler [Lakshminarayana 98] WaveSched Scheduler [Lakshminarayana 98] Basic Block Control Graph Scheduling [Santos 99] Basic Block Control Graph Scheduling [Santos 99] Limitations Limitations Arbitrary nesting of conditionals and loops not handled or handled poorly Arbitrary nesting of conditionals and loops not handled or handled poorly ad hoc optimizations: optimizations applied in isolation ad hoc optimizations: optimizations applied in isolation Limited/no analysis of logic and control costs Limited/no analysis of logic and control costs Not clear if an optimization has positive impact beyond scheduling Not clear if an optimization has positive impact beyond scheduling

Copyright Sumit Gupta Basic Instruction Length Decoder: Initial Description Length Contribution 1 Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 = Total Length Of Instruction Length Contribution 2Length Contribution 3Length Contribution 4

Copyright Sumit Gupta Instruction Length Decoder: Decoding 2 nd Instruction Length Contribution 1 Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 3Byte 4 = Total Length Of Insn Length Contribution 2Length Contribution 3Length Contribution 4 Byte 5Byte 6 First Insn After decoding the length of an instruction  Start looking from next byte  Again examine up to 4 bytes to determine length of next instruction

Copyright Sumit Gupta Instruction Length Decoder: Parallelized Description Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 = Total Length Of Instruction  Speculatively calculate the length contribution of all 4 bytes at a time  Determine actual total length of instruction based on this data

Copyright Sumit Gupta ILD: Extracting Further Parallelism Byte 1Byte 2Byte 3 Byte 4 Byte 1 Insn. Len Calc Byte 3 Insn. Len Calc Byte 5 Insn. Len Calc Byte 2 Insn. Len Calc Byte 4 Insn. Len Calc Byte 5  Speculatively calculate length of instructions assuming a new instruction starts at each byte  Do this calculation for all bytes in parallel  Traverse from 1 st byte to last  Determine length of instructions starting from the 1 st till the last  Discard unused calculations

Copyright Sumit Gupta Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Need Byte 4 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 2Length Contribution 3Length Contribution 4 Need Byte 2 ?

Copyright Sumit Gupta BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB c b + a BB 0 BB 9 + d How Code Motions are Employed by Scheduler BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB c b d + + a BB 0 BB 9 Speculate Across HTG

Copyright Sumit Gupta BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB c b + a BB 0 BB 9 + d BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB c b d + + Across HTG Conditional Speculation + a + d BB 0 BB 9 + d How Code Motions are Employed by Scheduler

Copyright Sumit Gupta Candidate Chooser Candidate Mover Candidate Fetcher IR Walker Scheduler Dynamic Transforms Integrating transformations into Scheduler Candidate Walker Candidate Validater Available Operations Determine Code Motions Required to schedule op Branch Balancing During Traversal Branch Balancing During Code Motion Apply Speculative Code Motions Apply Speculative Code Motions

Copyright Sumit Gupta Architecture of the PHLS Scheduler Candidate Chooser Candidate Mover Candidate Fetcher IR Walker Traverses Design to find next basic block to schedule Traverses Design to find Candidate Operations to schedule Calculates Cost of Operations and chooses Operation with lowest cost for scheduling Moves, duplicates and schedules chosen Operation Scheduler Dynamic Transforms Dynamically apply transformations such as CSE on remaining Candidate Operations using scheduled operation

Copyright Sumit Gupta Publications Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, DATE, March 2003 S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, DATE, March 2003 SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, VLSI Design 2003 Best Paper Award SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, VLSI Design 2003 Best Paper Award Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2002 Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2002 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Speculation Techniques for High Level synthesis of Control Intensive Designs S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2001 Speculation Techniques for High Level synthesis of Control Intensive Designs S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2001 Analysis of High-level Address Code Transformations for Programmable Processors S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATE 2000 Analysis of High-level Address Code Transformations for Programmable Processors S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATE 2000 Synthesis of Testable RTL Designs using Adaptive Simulated Annealing Algorithm C.P. Ravikumar, S. Gupta, A. Jajoo, Intl. Conf. on VLSI Design, 1998 Best Student Paper Award Synthesis of Testable RTL Designs using Adaptive Simulated Annealing Algorithm C.P. Ravikumar, S. Gupta, A. Jajoo, Intl. Conf. on VLSI Design, 1998 Best Student Paper Award Book Chapter ASIC Design, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook, Edited by Wai- Kai Chen ASIC Design, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook, Edited by Wai- Kai Chen 2 Journal papers and 1 Conference paper under submission