Download presentation
Presentation is loading. Please wait.
1
ECE 565 High-Level Synthesis—An Introduction
Shantanu Dutt ECE Dept., UIC
2
HLS Flow Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
3
HLS Flow (contd)
4
HLS Flow (contd) Taken into consideration Taken into consideration
during register allocation (post scheduling). Taken into consideration during scheduling. (Binding) Allocation: Simple counting of FUs after the above 2 stages
5
Simple HLS Examples +
6
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc z ldz X + a b c d mux demux x y lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 cc 3i+1 (a) Scheduling i) Non-overlapped pipelined scheduling: Schedule an operation when i/p data and FU available (may need to break ties between competing operations) c1(1) c1(2) c2(1) c3(1) c2(2) c3(2) X + cc’s (b) Arch. Synthesis: Binding & FU, reg, mux/demux allocation and interconnection 1 2 3 4 5 6 O1 O0 (c) Controller FSM Synthesis [y c+d] (c2) mux1=0, mux2=0 demux=0, ldy=1 Controller FSM: Reset cc 3i lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. Note: Unspecified control signals (cs) have either an inactive value, or if such a concept doesn’t exist for the cs, then the don’t-care value ldx=1 cc 3i+2 lda = 1 reg. “a” loaded [z x+y] (c3) [x a x b] (c1)
7
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) z ldz X + a b c d mux demux x y lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 (a) Scheduling ii) Overlapped pipelined scheduling X c1(1) c1(2) (b) Arch. Synthesis + c2(1) c3(1) c2(2) c3(2) cc’s 1 2 3 4 5 6 cc 3i+1 (c) Controller FSM Synthesis [z x+y,] (c3) ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 Controller FSM: Reset cc 3i For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y c+d, x a x b] ((c1, c2)
8
Simple HLS Examples (contd)
Condition (T/F) in out1 out2 T F Distributor in1 in2 out Selectot Some DFG control operation nodes: Conditional code: If (a > b) then c a-b; Else c b-a; Possible DFGs corresponding to the above conditional code: Note that the 2 subs in the left dfg does not mean an HLS algorithm will use 2 subtractors/adders. A good one will use 1, which will be shared in a mutually exclusive way between the two subs that are in two different sections of an if-then-else
9
Simple HLS Examples (contd)
Iterative code: while (a > b) a a-b; dist > sel - a b T F Initialized to F a b lda ldb 1 Mux mux b’ a c2 b’+1 = 2’s compl. of -b c1 To fsm + cin 1 s xor ovfl = 1 -ve = 0 +ve r1 ldr1 and (s xor ovfl) demux Demux 1 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis c1 c2 + cc’s Scheduling & binding:
10
Delay Nodes in DFGs A delay node is generally implemented as a register (or a series of registers if clock period < T0); a delay node thus becomes a state variable.
11
Delay Nodes in DFGs (contd)
register Mapping to the architecture w/ the register decoupling input and output s.t. register i/p = o/p of combinational part and register o/p = i/p of combinational part, and these can be treated as independent of each other as their availabilities are in different time steps (e.g., clock cycles) Transformation in the DFG
12
Detailed HLS Example
13
Detailed HLS Example (contd)
Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available closest to u’s earliest finish (i.e., asap time of child is earliest), otherwise the FU(s) will be idle unnecessary leading to a larger latency (this will also reduce lifetimes of sibling o/ps). Different paths (i/p o/p) in the DFG (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); Goal:Miinimize latency (b) Reg. alloc. for o/p of operations For WAR constraint [can’t store in d1 as would be natural, as d1’s current data yet to be consumed by c6 which has not been scheduled yet] The synthesized architecture (c) Arch. synthesis Note: Above register allocation for adder has been done w/ separate regs for multiplier and adder o/ps. It is sub-optimal (4 non-primary i/p regs. needed)
14
Detailed HLS Example (contd)
15
Detailed HLS Example—Register Allocation
16
Detailed HLS Example—Register Allocation (contd)
3 non-primary i/p regs. needed Scheduling heuristic: As stated earlier d0 In the conflict graph (one per FU [as here] if regs are grouped by FU, else one per FU type if regs are shared across each FU type or only one [global] if regs are shared across FUs), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)
17
Detailed HLS Example—Register Allocation (contd)
3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties arbitrarily: B’s lifetime increases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.