Download presentation
Presentation is loading. Please wait.
Published byFelix Melton Modified over 9 years ago
1
08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh GuptaAlex Nicolau Supported by Semiconductor Research Corporation Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Strategic CAD Labs Design Technologies Intel Inc, Hillsboro http://www.intel.com/research/scl Coordinating Transformations for High-Level Synthesis of High Performance Microprocessor Blocks
2
2 Classical High Level Synthesis From C to CDFG to Architecture Classical HLS targets ASIC designs Target of this work: Microprocessor Block Design A new domain for the application of high-level synthesis New synthesis methodology has been developed Focus is on code transformations to improve QOR
3
3 Characteristics of ASIC Design Large designs Several ALUs, Multipliers Controller (FSM) Register File Multi-cycle implementation Intermediate results stored in latches or pipeline registers
4
4 HL-Synthesis of ASIC Designs Large designs Multi-cycle implementation Implications on High-Level Synthesis Methodology Resource constrained Extraction of parallelism constrained by area limitations Speculation may lead to additional registers More conservative with transformations such as loop unrolling
5
5 Microprocessor Architecture Register File Instruction Decode Deeply Pipelined Execution Unit Specialized Unit Microprocessors: Deeply pipelined Complex blocks within pipeline stages Previous work: Pipeline scheduling Mapping applications to a microprocessor architecture
6
6 Characteristics of Microprocessor Blocks Small, Complex Units Several small computation blocks Intermix of control and data logic Single or Dual cycle implementation Inputs and outputs are stored in memory elements
7
7 HL-Synthesis of Microprocessor Blocks Small designs with high performance requirements Implications on High-Level Synthesis Methodology Area constraints are lax Extract maximal parallelism All loops have to be unrolled Pack all operations into a small number of cycles and in the shortest cycle time Operations within behavior are chained together with no intermediate latching Changes the stress on which transformations are “useful” and how they must be applied
8
8 Loop Unrolling Loop unrolling is usually restricted for ASIC designs Leads to code explosion In terms of hardware, it means Large FSM controllers Complex interconnect logic For Microprocessor Blocks Loops represent a programming convenience Whole loop is scheduled in one/two cycles All iterations have to execute within one/two cycles In hardware, the loop will be unrolled anyway i = 0 i < N Loop Body LB(i) i = i +1 Pipeline Registers One Cycle Pipeline Registers
9
9 Fully Unroll Loops i = 0 i < N Loop Body LB(i) i = 0 1 st Iteration LB(0) 2 nd Iteration LB(1) N th Iteration LB(N-1) Unroll Loop i = i +1
10
10 Chaining Operations Across Conditional Boundaries
11
11 Inserting “Wire-Variables” to enable Chaining BB 1BB 2 BB 3 BB 0 TrueFalse X = a + b Z = X + d X= c Cond BB 1BB 2 BB 3 BB 0 TrueFalse Wv = a + b X = Wv Z = Wv + d Wv = c X = Wv Cond ALU ab Cond c d ZX Wv Wv is mapped to a wire; all other variables are mapped to registers
12
12 Supporting Transformations: Beyond Basic Block Code Motions + + + If Node TF Conditional Speculation Reverse Speculation Speculation Across Hierarchical Blocks
13
13 A Case Study: Instruction Length Decoder Validated this methodology using a design derived from the Instruction Length Decoder of the Intel Pentium® class of processors Takes a stream of instructions from memory Decodes the length of these instructions Has to look at up to 4 bytes at a time Has to execute in one cycle Implemented this methodology along with supporting transformations in the Spark high-level synthesis (HLS) framework Takes a behavioral description in C as input and produces synthesizable VHDL Has various supporting code optimizations Constant propagation, Dead code elimination
14
14 Basic Instruction Length Decoder: Initial Description Length Contribution 1 Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 = + + + Total Length Of Instruction Length Contribution 2Length Contribution 3Length Contribution 4 Single Cycle implementation Natural behavioral description is sequential and slow Must be parallelized and compacted into one cycle with low clock time
15
15 Instruction Length Decoder: Parallelized Description Speculatively calculate the length contribution of all 4 bytes at a time Determine actual total length of instruction based on this data Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 = + + + Total Length Of Instruction
16
16 Instruction Length Decoder: Parallelized Description Byte 1Byte 2Byte 3 Byte 4 Byte 1 Insn. Len Calc Byte 3 Insn. Len Calc Byte 5 Insn. Len Calc Byte 2 Insn. Len Calc Byte 4 Insn. Len Calc Byte 5 Speculatively calculate length of instructions assuming a new instruction starts at each byte Do this calculation for all bytes in parallel Traverse from 1 st byte to last Determine length of instructions starting from the 1 st till the last Discard unused calculations
17
17 Steps Involved in Synthesis of the ILD Speculatively calculate all possible lengths of an instruction at byte “i” Achieved by speculative code motions Speculatively calculate length of instructions assuming an instruction starts at each byte Achieved by loop unrolling, loop index variable elimination and speculative code motions Pack all operations into one cycle Achieved by chaining all operations across conditional boundaries Step-by-step code refinement is presented in the paper
18
18 Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Need Byte 4 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 2 Length Contribution 3 Length Contribution 4 Need Byte 2 ?
19
19 ILD Synthesis: Resultant Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture
20
20 Conclusions Demonstrated a high-level synthesis methodology for a new domain: Microprocessor Block Design Small number of Cycles Short Cycle Times Extract Maximal Parallelism Aggressive Speculative Code Motions Unrolling loops fully + other loop transformations Pack all operations in behavior into a few cycles Chaining operations across conditionals Implemented in the Spark HL Synthesis Framework Takes C input and produces synthesizable VHDL Industrial case study: Instruction Length Decoder Ongoing work => Broaden the application base of this methodology and develop more supporting transformations Very Low Latency
21
21 Thank You !
22
22 Additional Slides
23
23 Loop Index Variable Elimination i = 0 R1(i) = Op1(i) R1(i+1) = Op1(i+1) R1(i+N-1) = Op1(i+N-1) Propagate Constant i = 0 R1(0) = Op1(0) R1(1) = Op1(1) R1(N-1) = Op1(N-1) i = 0
24
24 Original Specification Speculatively Calculate all possible lengths at i Speculate Data Calculation Control Logic
25
25 After Speculative Calculation at each byte Unroll Loop Propagate Loop Index Var Speculative Calculation of All Instruction Lengths Assuming an Instruction Starts at each Byte
26
26 ILD: Final Architecture
27
27 ILD: Algorithmic Description Calculate LC1 if Calculate LC2 if Calculate LC3 Yes Length = LC1 Length = LC1+ LC2 Length = LC1+ LC2 + LC3 if Calculate LC3 Yes Length = LC1+ LC2 + LC3 + LC4 No Do in a loop Starting with 1 st byte till the N th Byte Need 2 nd Byte ? Need 3 rd Byte ? Need 4 th Byte ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.