Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Coordinated Coarse-Grain and Fine-Grain Optimizations for High-Level Synthesis Supported by Semiconductor Research Corporation Sumit Gupta

2 High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture

3 High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

4 Outline Motivation and Background Motivation and Background Our Approach to Parallelizing High-Level Synthesis Our Approach to Parallelizing High-Level Synthesis Code Transformations Techniques for PHLS Code Transformations Techniques for PHLS Parallelizing Transformations Parallelizing Transformations Dynamic Transformations Dynamic Transformations The PHLS Framework and Experimental Results The PHLS Framework and Experimental Results Multimedia and Image Processing Applications Multimedia and Image Processing Applications Case Study: Intel Instruction Length Decoder Case Study: Intel Instruction Length Decoder Contributions Contributions Future Work Future Work

5 High-level Synthesis Well-researched area: from early 1980’s Well-researched area: from early 1980’s Recently, the level of design entry has moved up from schematic entry to coding in HDLs (VHDL, Verliog, C, C++) Recently, the level of design entry has moved up from schematic entry to coding in HDLs (VHDL, Verliog, C, C++) A large number of optimizations have been proposed over the years A large number of optimizations have been proposed over the years Many HLS optimizations are either at the operation level (e.g., algebraic transformations on DSP codes) or at the logic level (e.g., Don’t Care based control optimizations) Many HLS optimizations are either at the operation level (e.g., algebraic transformations on DSP codes) or at the logic level (e.g., Don’t Care based control optimizations) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Few/No Compiler optimizations have been applied to HLS Few/No Compiler optimizations have been applied to HLS  Quality of synthesis results severely effected by complex control flow, characterized by Nested ifs and loops

6 Recent HLS Optimizations Mostly related to code scheduling in the presence of conditionals Mostly related to code scheduling in the presence of conditionals Condition Vector List Scheduling [Wakabayashi 89] Condition Vector List Scheduling [Wakabayashi 89] Symbolic Scheduling [Radivojevic 96] Symbolic Scheduling [Radivojevic 96] WaveSched Scheduler [Lakshminarayana 98] WaveSched Scheduler [Lakshminarayana 98] Basic Block Control Graph Scheduling [Santos 99] Basic Block Control Graph Scheduling [Santos 99] Limitations Limitations Arbitrary nesting of conditionals and loops not handled or handled poorly Arbitrary nesting of conditionals and loops not handled or handled poorly ad hoc optimizations: optimizations applied in isolation ad hoc optimizations: optimizations applied in isolation Limited/no analysis of logic and control costs Limited/no analysis of logic and control costs Not clear if an optimization has positive impact beyond scheduling Not clear if an optimization has positive impact beyond scheduling

7 Focus of this Work Target Applications: Target Applications: Descriptions with complex and nested conditionals and loops: Descriptions with complex and nested conditionals and loops: Multimedia & Image processing applications with a mix of data and control operations Multimedia & Image processing applications with a mix of data and control operations Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles Objectives: Objectives: Improve quality of overall HLS results (circuit delay and area) Improve quality of overall HLS results (circuit delay and area) Find balance of parallelism and hardware costs under resource constraints Find balance of parallelism and hardware costs under resource constraints Improve controllability of the HLS solutions Improve controllability of the HLS solutions Analyze control and area costs of transformations Analyze control and area costs of transformations Approach Approach Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 Build an experimental PHLS framework to enable high level design exploration for HLS. Build an experimental PHLS framework to enable high level design exploration for HLS.

8 Focus of this Work Target Applications: Target Applications: Descriptions with complex and nested conditionals and loops: Descriptions with complex and nested conditionals and loops: Multimedia & Image processing applications with a mix of data and control operations Multimedia & Image processing applications with a mix of data and control operations Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles Objectives: Objectives: Improve quality of overall HLS results (delay and area) Improve quality of overall HLS results (delay and area) Find balance of parallelism and hardware costs under resource constraints Find balance of parallelism and hardware costs under resource constraints Improve controllability of the HLS solutions Improve controllability of the HLS solutions Analyze control and area costs of transformations Analyze control and area costs of transformations Approach Approach Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 Build an experimental PHLS framework to enable high level design exploration for HLS. Build an experimental PHLS framework to enable high level design exploration for HLS. Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

9 Important Parallelizing Compiler Techniques Transformations to exploit instruction-level parallelism Transformations to exploit instruction-level parallelism Speculative Code Motions Speculative Code Motions Attempt to move operations out of and sometimes even duplicate into conditional blocks Attempt to move operations out of and sometimes even duplicate into conditional blocks Percolation Scheduling and Trailblazing Percolation Scheduling and Trailblazing Can produce optimal schedule given enough resources Can produce optimal schedule given enough resources Loop Transformations Loop Transformations Loop Invariant Code Motion: reduce num of operations executed Loop Invariant Code Motion: reduce num of operations executed Loop Unrolling and Pipelining: expose inter-iteration parallelism Loop Unrolling and Pipelining: expose inter-iteration parallelism Induction Variable Analysis: operation strength reduction Induction Variable Analysis: operation strength reduction Loop Fusion, Interchange: increase scope of transformations Loop Fusion, Interchange: increase scope of transformations Partial evaluation and removing redundant and useless operations Partial evaluation and removing redundant and useless operations CSE, Copy Propagation, Constant Folding, Dead Code Elimination CSE, Copy Propagation, Constant Folding, Dead Code Elimination

10 Useful, but important differences with HLS Different cost models between programmable processors and synthesized hardware Different cost models between programmable processors and synthesized hardware For instance, routing resources, control logic cost can be significant compared to functional unit costs. For instance, routing resources, control logic cost can be significant compared to functional unit costs. Transformations have implications on hardware Transformations have implications on hardware Non-trivial control and area costs Non-trivial control and area costs Operation duplication leads to flexible scheduling ; however, can lead to higher control costs Operation duplication leads to flexible scheduling ; however, can lead to higher control costs Integration with synthesis transformations Integration with synthesis transformations Operation Chaining Operation Chaining Notion of mutual exclusivity of operations Notion of mutual exclusivity of operations Resource Sharing Resource Sharing

11 Our Approach to Parallelizing HLS (PHLS) C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling

12 Our Approach to PHLS Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling Dynamic transformations: exploit new opportunities during scheduling Develop heuristics that balance parallelism extracted by code transformations with the hardware costs: improve overall QOR Develop heuristics that balance parallelism extracted by code transformations with the hardware costs: improve overall QOR Increase Code Compaction (improve resource utilization) Increase Code Compaction (improve resource utilization)  Reduce impact of programming style/control constructs on HLS results  Useful in descriptions with nested conditionals and loops  Choice of Transformations we explore is based on  Choice of Transformations we explore is based on: Improving Performance Increasing Code Compaction Invariance to Programming Style Extracting Parallelism

13 PHLS Transformations Organized into Four Groups 1.Pre-synthesis: Loop-invariant code motions, Loop unrolling, CSE 2.Scheduling: Speculative Code Motions, Multi- cycling, Operation Chaining, Loop Pipelining 3.Dynamic: Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch Balancing 4.Basic Compiler Transformations: Copy Propagation, Dead Code Elimination

14 1. Pre-synthesis: Loop Invariant CM Loop Node TF … BB 1 BB 2 BB 3 BB 4 1: a = b + c 2: d = a + c 3: e = e + d … i = i + 1 i < n Loop Exit … Loop Node TF … BB 1 BB 2 BB 3 BB 4 3: e = e + d … i = i + 1 i < n Loop Exit 1: a = b + c 2: d = a + c

15 1. Pre-synthesis: Loop Invariant CM Loop Node TF … BB 1 BB 2 BB 3 BB 4 1: a = b + c 2: d = a + c 3: e = e + d … i = i + 1 i < n Loop Exit … Loop Node TF … BB 1 BB 2 BB 3 BB 4 3: e = e + d … i = i + 1 i < n Loop Exit 1: a = b + c 2: d = a + c Reduce number of operations that execute in loops Reduce number of operations that execute in loops  Putting code inside loops is a programming convenience Common situation in media applications Common situation in media applications

16 1. Common Sub-Expression Elimination a = b + c; c = b < c; if (c) d = b + c; else e = g + h; C Description BB 2BB 3 BB 1 d = b + c BB 4 a = b + c e = g + h HTG Representation If Node T F BB 0 BB 2BB 3 BB 1 d = a BB 4 a = b + c e = g + h After CSE If Node TF BB 0

17 1. Common Sub-Expression Elimination a = b + c; c = b < c; if (c) d = b + c; else e = g + h; C Description BB 2BB 3 BB 1 d = b + c BB 4 a = b + c e = g + h HTG Representation If Node T F BB 0 BB 2BB 3 BB 1 d = a BB 4 a = b + c e = g + h After CSE If Node TF BB 0 We use notion of Dominance of Basic Blocks We use notion of Dominance of Basic Blocks A basic block BBi dominates another basic block BBj if all control paths from the initial basic block of the design graph leading to BBj goes through BBi A basic block BBi dominates another basic block BBj if all control paths from the initial basic block of the design graph leading to BBj goes through BBi We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj

18 2. Scheduling Transformations: Speculative Code Motions + + If Node TF TF ++ Reverse Speculation Conditional Speculation __ + Across Hierarchical Blocks _ _ a b c Hierarchical Task Graph Representation Resource Utilization + + _ Schedule under Resource Constraints BB0 BB1 BB2 BB3

19 BB0 BB1 BB2 BB3 + + _ Schedule under Resource Constraints 2. Scheduling Transformations: Speculative Code Motions + + If Node TF TF ++ Reverse Speculation Conditional Speculation __ + Across Hierarchical Blocks _ _ a b c Hierarchical Task Graph Representation Resource Utilization Lead to Shorter Schedule Lengths by utilizing resources that are “idle” in earlier cycles Lead to Shorter Schedule Lengths by utilizing resources that are “idle” in earlier cycles Reduce the impact of programming style (operation placement) on quality of HLS results Reduce the impact of programming style (operation placement) on quality of HLS results

20 Hardware Costs of Speculative Code Motions: Speculation If Node TF Speculation d = e + f ALU he d c g = h + i c = a < b if g S0 S1 S1.c S1.!c S1.c S1.!c BB0 BB1BB2 BB3

21 BB0 BB1BB2 BB3 Hardware Costs of Speculative Code Motions: Speculation If Node TF d = d’ ALU he d’ c g = h + i d’ = e + f; c = a < b if g S0 S1 S0 S1.!c S0 S1.!c d S1.c Might be able to eliminate extra register by careful resource binding (interconnect aware) Might be able to eliminate extra register by careful resource binding (interconnect aware)

22 Hardware Costs of Speculative Code Motions: Conditional Speculation If Node TF d = e + f ALU eh d c fi g S0 S1 S0 S3 S0 g = h + i; c = a < b S3 S2 e = g Conditional Speculation BB0 BB1 BB2 BB3

23 Hardware Costs of Speculative Code Motions: Conditional Speculation If Node TF d = g + f ALU eg d c fi g S0 S1 S0 g = h + i; c = a < b S3 S2 d = e + f S1.c + S2.!c Fewer Cycles but more complex Control and Multiplexing Fewer Cycles but more complex Control and Multiplexing e = g h S1.c S1.!c BB0 BB1 BB2 BB3

24 Hardware Costs of Speculative Code Motions Speculative code motions lead to two opposite effects on the control, multiplexing and area costs Speculative code motions lead to two opposite effects on the control, multiplexing and area costs 1. Shorter Schedule lengths Leads to Smaller Controllers (fewer states in State Machine) Leads to Smaller Controllers (fewer states in State Machine) This leads to smaller area This leads to smaller area 2. More multiplexing and control costs to steer the data Particularly when operations are conditionally speculated Particularly when operations are conditionally speculated Leads to longer critical paths Leads to longer critical paths ALU eh d fi ControlLogic Current State & Conds CriticalPath

25 3. Dynamic Transformations Called “dynamic” since they are applied during scheduling (versus a pass before/after scheduling) Called “dynamic” since they are applied during scheduling (versus a pass before/after scheduling) Dynamic Branch Balancing Dynamic Branch Balancing Increase the scope of code motions Increase the scope of code motions Reduce impact of programming style on HLS results Reduce impact of programming style on HLS results Dynamic CSE and Dynamic Copy Propagation Dynamic CSE and Dynamic Copy Propagation Exploit the Operation movement and duplication due to speculative code motions Exploit the Operation movement and duplication due to speculative code motions Create new opportunities to apply these transformations Create new opportunities to apply these transformations Reduce the number of operations Reduce the number of operations

26 3. Dynamic Branch Balancing If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation Original Design If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d Scheduled Design Unbalanced Conditional Longest Path

27 Insert New Scheduling Step in Shorter Branch If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation Original DesignScheduled Design

28 Insert New Scheduling Step in Shorter Branch If Node TF BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d If Node TF _ e BB 0 BB 2 BB 1 BB 3 BB 4 + a + b _ c _ d S0 S1 S2 S3 + Resource Allocation e __ e Original DesignScheduled Design Dynamic Branch Balancing is done 1.While Traversing the design 2.And if it enables Conditional Speculation

29 3. Dynamic CSE: Going beyond Traditional CSE a = b + c; c = b < c; if (c) d = b + c; else e = g + h; C Description BB 2BB 3 BB 1 d = b + c BB 4 a = b + c e = g + h HTG Representation If Node T F BB 0 BB 2BB 3 BB 1 d = a BB 4 a = b + c e = g + h After CSE If Node TF BB 0

30 New Opportunities for “Dynamic” CSE Due to Code Motions BB 2BB 3 BB 1 a = b + c BB 6BB 7 BB 5 d = b + c BB 4 BB 8 Scheduler decides to Speculate BB 2BB 3 BB 1 a = dcse BB 6BB 7 BB 5 d = b + c BB 4 BB 8 dcse = b + c BB 0 CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 does not dominate BB6

31 BB 2BB 3 BB 1 a = b + c BB 6BB 7 BB 5 d = b + c BB 4 BB 8 BB 2BB 3 BB 1 a = dcse BB 6BB 7 BB 5 d = dcse BB 4 BB 8 dcse = b + c BB 0 Scheduler decides to Speculate New Opportunities for “Dynamic” CSE Due to Code Motions CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 does not dominate BB6 If scheduler moves or duplicates an operation op, apply CSE on remaining operations using op

32 Condition Speculation & Dynamic CSE BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = b + c BB 3 BB 7 d = b + c BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = a' BB 3 BB 7 a' = b + c d = b + c BB 8 Scheduler decides to Conditionally Speculate

33 Condition Speculation & Dynamic CSE BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = b + c BB 3 BB 7 d = b + c BB 1BB 2 BB 0 BB 5BB 6 BB 4 a = a' BB 3 BB 7 a' = b + c d = a' BB 8 Scheduler decides to Conditionally Speculate Use the notion of dominance by groups of basic blocks Use the notion of dominance by groups of basic blocks => BB1 and BB2 together dominate BB8 All Control Paths leading up to BB8 come from either BB1 or BB2: => BB1 and BB2 together dominate BB8

34 Integrating the Parallelizing Transformations into a HLS Scheduler Employ Speculative Code Motions during Scheduling Employ Speculative Code Motions during Scheduling Perform Branch Balancing Perform Branch Balancing While traversing the design While traversing the design If it enables code motion (during scheduling) If it enables code motion (during scheduling) Perform Dynamic CSE after scheduling an operation Perform Dynamic CSE after scheduling an operation After the scheduled operation has been moved and possibly duplicated After the scheduled operation has been moved and possibly duplicated

35 Architecture of the PHLS Scheduler Candidate Chooser Candidate Mover Candidate Fetcher IR Walker Traverses Design to find next basic block to schedule Traverses Design to find Candidate Operations to schedule Calculates Cost of Operations and chooses Operation with lowest cost for scheduling Moves, duplicates and schedules chosen Operation Scheduler Dynamic Transforms Dynamically apply transformations such as CSE on remaining Candidate Operations using scheduled operation

36 Candidate Chooser Candidate Mover Candidate Fetcher IR Walker Scheduler Dynamic Transforms Integrating transformations into Scheduler Candidate Walker Candidate Validater Available Operations Determine Code Motions Required to schedule op Branch Balancing During Traversal Branch Balancing During Code Motion Apply Speculative Code Motions Apply Speculative Code Motions

37 Scheduling Problem Formulation Data Flow Graph Gd(Vop, Edata) Data Flow Graph Gd(Vop, Edata) Control Flow Graph Gc(Vbb, Econtrol) Control Flow Graph Gc(Vbb, Econtrol) Mapping Φ : Vops → Vbb, Resource List: R Mapping Φ : Vops → Vbb, Resource List: R  Find New Mapping Φsched : Vops → Vbb and Start times Τ of all operations in Vop such that  Each vop starts executing after all its predecessor operations in Gd have finished executing  Each vop is mapped to a resource in R  If vbb s = BB sched (vop) and vbb un = BB UnSched (vop)  Then either vbb s dominates vbb un or vbb un dominates vbb s  Or vop is duplicated into a set of basic blocks β such that either  β dominates vbb un or vbb un dominates β Speculative Code Motions Given Use Hierarchical Task Graphs for enabling efficient code motions Conditional and Reverse Speculation: Duplication of operations into multiple Conditional Branches Speculation and movement along control paths without duplication

38 SPARK High Level Synthesis Framework

39 SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Range of compiler, parallelizing compiler and HLS transformations applied during Pre-synthesis and Scheduling phases Range of compiler, parallelizing compiler and HLS transformations applied during Pre-synthesis and Scheduling phases Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results Enables Graphical Visualization of Design description and intermediate results About 100,000 + lines of C++ code About 100,000 + lines of C++ code

40 HTGDFG Graph Visualization

41 Resource Utilization Graph Scheduling

42 Example of Complex HTG Example of a real design: MPEG-1 pred2 function Example of a real design: MPEG-1 pred2 function Multiple nested loops and conditionals Multiple nested loops and conditionals

43 Experiments Experiments for several transformations Experiments for several transformations Pre-synthesis transformations Pre-synthesis transformations Speculative Code Motions Speculative Code Motions Dynamic CSE Dynamic CSE We used SPARK to synthesize designs derived from several industrial designs We used SPARK to synthesize designs derived from several industrial designs MPEG-1, MPEG-2, GIMP Image Processing software MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Case Study of Intel Instruction Length Decoder Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area

44 Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred1 4217123 MPEG-1 pred2 11645287 MPEG-2 dp_frame 18461260 GIMPtiler11235150

45 + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39%

46 + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks Overall: 63-66 % improvement in Delay Almost constant Area 42% 10% 36% 8% 39%

47 Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52%

48 Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: 48-76 % improvement in Delay Almost constant Area

49 Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Length Decoder First Insn Second Insn Third Instruction Instruction Buffer

50 Case Study: ILD Block from Intel A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Decodes length of instructions streaming from memory Decodes length of instructions streaming from memory Sequentially looks at up to 4 bytes for each instruction Sequentially looks at up to 4 bytes for each instruction Has to execute in one cycle and decode about 64 bytes of instructions Has to execute in one cycle and decode about 64 bytes of instructions  Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Low Latency: Single or Dual cycle implementation Consist of several small computations Consist of several small computations Intermix of control and data logic Intermix of control and data logic

51 Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Need Byte 4 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 2Length Contribution 3Length Contribution 4 Need Byte 2 ?

52 ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture

53 ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture Extract max parallelism: full loop unrolling, speculation Extract max parallelism: full loop unrolling, speculation Pack all operations in Single Cycle by Operation Chaining Pack all operations in Single Cycle by Operation Chaining Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains

54 Conclusions Parallelizing code transformations enable a new range of HLS transformations Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results Enable Synthesis to dominate embedded system design methodology Enable Synthesis to dominate embedded system design methodology Can enable productivity improvements in microelectronic design Can enable productivity improvements in microelectronic design We have shown that it is possible to optimize & synthesize designs with complex control flow We have shown that it is possible to optimize & synthesize designs with complex control flow Achieve an improvement between 50-75 % in performance across a number of designs Achieve an improvement between 50-75 % in performance across a number of designs Also shown its effectiveness on an Intel design Also shown its effectiveness on an Intel design

55 Contributions of this Work Developed a parallelizing high-level synthesis methodology Developed a parallelizing high-level synthesis methodology Developed and implemented a diverse range of compiler, parallelizing compiler and HLS transformations Developed and implemented a diverse range of compiler, parallelizing compiler and HLS transformations Identified a set of transformations that are “useful” for HLS Identified a set of transformations that are “useful” for HLS Developed guiding heuristics to improve overall synthesis results Developed guiding heuristics to improve overall synthesis results Demonstrated the utility of the transformations on a C-to-VHDL framework: Demonstrated the utility of the transformations on a C-to-VHDL framework: Platform for applying Coarse & Fine Grain Optimizations Platform for applying Coarse & Fine Grain Optimizations Tool-box approach where transformations and heuristics can be developed Tool-box approach where transformations and heuristics can be developed Scripts give designer control over transformations applied Scripts give designer control over transformations applied Enables designer to find the right synthesis script for different application domains Enables designer to find the right synthesis script for different application domains Scheduling and logic synthesis results (analyzed control costs) Scheduling and logic synthesis results (analyzed control costs) Experimentation on industrial strength applications, including an Intel Design Experimentation on industrial strength applications, including an Intel Design

56 Future Directions Explore inter-loop-iteration parallelism Explore inter-loop-iteration parallelism Loop Unrolling and Pipelining Loop Unrolling and Pipelining We have done some recent initial work We have done some recent initial work Explore effects of transformations on power Explore effects of transformations on power Need to analyze power costs of transformations Need to analyze power costs of transformations Develop post scheduling pass that optimizes power without affecting performance Develop post scheduling pass that optimizes power without affecting performance Preliminary experiments suggest power remains almost constant after applying all transformations (power follows area curve) Preliminary experiments suggest power remains almost constant after applying all transformations (power follows area curve) Since Delay reduces 50-75 %, so Energy (= Power * Delay) also should reduce by 50-75 % Since Delay reduces 50-75 %, so Energy (= Power * Delay) also should reduce by 50-75 % However, unnecessary execution of ops due to speculation However, unnecessary execution of ops due to speculation Integrate SPARK into System Level Co-design Methodology Integrate SPARK into System Level Co-design Methodology Some initial work: manual partitioning, followed by synthesis of hardware component by SPARK for a FPGA platform Some initial work: manual partitioning, followed by synthesis of hardware component by SPARK for a FPGA platform

57 Publications Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive Designs S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, DATE, March 2003 S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, DATE, March 2003 SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, VLSI Design 2003 Best Paper Award SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, VLSI Design 2003 Best Paper Award Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2002 Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2002 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Speculation Techniques for High Level synthesis of Control Intensive Designs S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2001 Speculation Techniques for High Level synthesis of Control Intensive Designs S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2001 Analysis of High-level Address Code Transformations for Programmable Processors S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATE 2000 Analysis of High-level Address Code Transformations for Programmable Processors S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATE 2000 Synthesis of Testable RTL Designs using Adaptive Simulated Annealing Algorithm C.P. Ravikumar, S. Gupta, A. Jajoo, Intl. Conf. on VLSI Design, 1998 Best Student Paper Award Synthesis of Testable RTL Designs using Adaptive Simulated Annealing Algorithm C.P. Ravikumar, S. Gupta, A. Jajoo, Intl. Conf. on VLSI Design, 1998 Best Student Paper Award Book Chapter ASIC Design, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook, Edited by Wai- Kai Chen ASIC Design, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook, Edited by Wai- Kai Chen 2 Journal papers and 1 Conference paper under submission

58 Thank You

59 Additional Slides

60 How Close do we get to the Best Possible Critical Path Length (Cycles) Benchmark Best Cycles Longest Dep Chain % Difference MPEG1 pred1 767715 6.8 % MPEG2 pred2 19291355 30 % MPEG2 dpframe 486387 20.4 % GIMP tiler 22241701 23.5 %

61 Loop Node 1 TF BB4 BB5 BB6 j < N Loop1 Exit j = 0 BB3 Loop Node 2 i = i + 1 BB7 TF BB2 i < M i = 0 BB0 BB1 Loop2 Exit b = e + f 1: a = b + c 2: d = a + c j = j + 1 Longest Dep Chain = Critical Path = 2 * N * M + 1 Actual Cycles = (3*N +3)*M+1 ( (S3,S4,S5)*N+ (S1,S2,S6) )*M +S0 S0 S1 S2 S3 S4 S6 S5

62 Scheduling Heuristic BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB 8 + + + c b d + + a Get Available Ops Get Available Ops A = a, b, c, d A = a, b, c, d Determine Code Motions Required Determine Code Motions Required Assign Cost to each Operation Assign Cost to each Operation Cost is based on data dependency chain Cost is based on data dependency chain Schedule/Move “Op” with lowest Cost Schedule/Move “Op” with lowest Cost Apply Dynamic CSE on A using “Op” Apply Dynamic CSE on A using “Op” BB 0 BB 9 Speculate Across HTG

63 Scheduling Heuristic BB 2BB 3 BB 1 BB 6BB 7 BB 5 BB 4 BB 8 + + b d + a Get Available Ops Get Available Ops A = a, b, c, d A = a, b, c, d Determine Code Motions Required Determine Code Motions Required Assign Cost to each Operation Assign Cost to each Operation Cost is based on data dependency chain Cost is based on data dependency chain Schedule/Move “Op” with lowest Cost Schedule/Move “Op” with lowest Cost Apply Dynamic CSE on A using “Op” Apply Dynamic CSE on A using “Op” BB 0 BB 9 + c

Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark SPARK: A Parallelizing High-Level Synthesis Supported by Semiconductor Research Corporation Sumit Gupta, Nikil Dutt, Rajesh Gupta, Alex Nicolau

Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Framework Supported by Semiconductor Research Corporation Sumit Gupta, Nikil Dutt, Rajesh Gupta, Alex Nicolau

SPARK: A Parallelizing

High-Level Synthesis

Framework

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Similar presentations

Presentation on theme: "Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Similar presentations

Presentation on theme: "Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations."— Presentation transcript:

Similar presentations

About project

Feedback