ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC
HLS Flow Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd)
Allocation: Simple counting of FUs after the above 2 stages (Binding)
Simple HLS Examples +
Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc z ldz X + ab cd mux demux xy lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 I0 I1 demux cc 3(i+1) lda = 1 reg. “a” loaded Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 mux1=0, mux2=0 demux=0, ldy=1 ldx=1 [z x+y] (c3) [y c+d] (c2) [x a x b] (c1) cc 3i cc 3(i+2) Reset Controller FSM: c1(1)c1(2) c2(1) c3(1)c2(2) c3(2) X + i) Non-overlapped pipelined scheduling cc’s Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t- care value (a) Scheduling (b) Arch. Synthesis (c) Controller FSM Synthesis O0 O1
Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) c1(1)c1(2) c2(1)c3(1)c2(2)c3(2) X + ii) Overlapped pipelined scheduling z ldz X + ab cd mux demux xy lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 I0 I1 demux cc 3(i+1) lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 [y c+d, x a x b] ((c1, c2) [z x+y,] (c3) cc 3i Reset Controller FSM: cc’s For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule (a) Scheduling (b) Arch. Synthesis (c) Controller FSM Synthesis
Simple HLS Examples (contd) Condition (T/F) in out1 out2 TF Distributor Condition (T/F) in1 in2 out TF Selectot Some DFG control operation nodes: Conditional code: If (a > b) then c a-b; Else c b-a; Possible DFGs corresponding to the above conditional code:
Simple HLS Examples (contd) Iterative code: while (a > b) a a-b; dist > sel - a b a TF T F Initialized to F + b final a Mux Demux a r1 cin 1 b’+1 = 2’s compl. of -b b’ s xor ovfl = 1 -ve = 0 +ve mux ldr1 lda ldb demux ldfina To fsm c1 c2 c1c2 + cc’s c1c2 Scheduling & binding: a (a) Scheduling (using only 1 adder/sub) (b) Arch. Synthesis
Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture
Detailed HLS Example
Detailed HLS Example (contd) The synthesized architecture Note: Not clear how register allocation has been done. It is sub-optimal (4 non-primary i/p regs. needed) (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency Different paths (i/p o/p) in the DFG (b) Reg. alloc. for o/p of operations (c) Arch. synthesis For WAR constraint Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point.
Detailed HLS Example (contd)
Detailed HLS Example—Register Allocation
d0 3 non-primary i/p regs. needed Detailed HLS Example—Register Allocation (contd) In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing) Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point.
Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information