Download presentation
Presentation is loading. Please wait.
Published byDominic Owens Modified over 6 years ago
1
High-Level Synthesis Creating Custom Circuits from High-Level Code
Hao Zheng Comp Sci & Eng University of South Florida
2
Existing FPGA Tool Flow
Register-transfer (RT) synthesis Specify RT structure (muxes, registers, etc) Allows precise specification But, time consuming, difficult, error prone Synthesizable HDL RT Synthesis Netlist Technology Mapping Physical Design Placement Bitfile Routing Processor FPGA
3
Future FPGA Tool Flow High-level Synthesis RT Synthesis
C/C++, Java, etc. High-level Synthesis Synthesizable HDL HDL RT Synthesis Netlist Technology Mapping Physical Design Placement Bitfile Routing Processor FPGA
4
High-level Synthesis - Benefits
Ratio of C to VHDL developers (10000:1 ?) Easier to specify complex designs Technology/architecture independent design Manual HW design potentially slower Similar to assembly code era Programmers could always beat compiler But, no longer the case Ease of HW/SW partitioning enhance overall system efficiency More efficient verification and validation Easier to V & V of high-level code
5
High-level Synthesis More challenging than SW compilation
Compilation maps behavior into assembly instructions Architecture is known to compiler HLS creates a custom architecture to execute specified behavior Huge hardware exploration space Best solution may include microprocessors Ideally, should handle any high-level code But, not all code appropriate for hardware
6
High-level Synthesis: An Example
First, consider how to manually convert high-level code into circuit Steps Build FSM for controller Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i];
7
A Manual Example Build a FSM (controller) Decompose code into states
acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 i < 128 i >= 128 load a[i] Done acc += a[i] i++
8
A Manual Example Build a datapath Allocate resources for each state +
acc=0, i = 0 i < 128 i >= 128 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + < + + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i];
9
A Manual Example Build a datapath Determine register inputs + + + <
In from memory acc=0, i = 0 &a i < 128 i >= 128 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + + < i++ acc = 0; for (i=0; i < 128; i++) acc += a[i];
10
A Manual Example Build a datapath Add outputs + + + <
In from memory acc=0, i = 0 &a i < 128 i >= 128 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + < + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc Memory address
11
A Manual Example Build a datapath Add control signals + + + <
In from memory acc=0, i = 0 &a i < 128 i >= 128 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + < + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc Memory address
12
A Manual Example Combine controller + datapath Start Controller + +
In from memory Controller &a 2x1 2x1 2x1 i acc a[i] addr 1 128 1 + < + + acc = 0; for (i=0; i < 128; i++) acc += a[i]; Done Memory Read acc Memory address
13
A Manual Example Alternatives Use one adder (plus muxes) < +
In from memory &a 2x1 2x1 2x1 i acc a[i] addr 1 128 < MUX MUX + acc Memory address
14
A Manual Example Comparison with high-level synthesis
Determining when to perform each operation => Scheduling Allocating resource for each operation => Resource allocation Mapping operations onto resources => Binding
15
Another Example: Try it at home
Your turn x=0; for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --; a[i] = x; } //output x Build FSM (do not perform if conversion) Build datapath based on FSM
16
High-Level Synthesis Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc. high-level code High-Level Synthesis Custom Circuit Usually a RT VHDL/Verilog description, but could as low level as a bit file
17
High-Level Synthesis – Overview
acc = 0; for (i=0; i < 128; i++) acc += a[i]; High-Level Synthesis acc i < addr a[i] + 1 128 2x1 &a In from memory Memory address Done Memory Read Controller
18
Cycle accurate RTL code
Main Steps High-level Code Converts code to intermediate representation - allows all following steps to use language independent format. Front-end Syntactic Analysis Intermediate Representation Optimization Determines when each operation will execute, and resources used Scheduling/Resource Allocation Back-end Binding/Resource Sharing Maps operations onto physical resources Cycle accurate RTL code
19
Intermediate Representation
Syntactic Analysis Definition: Analysis of code to verify syntactic correctness Converts code into intermediate representation Steps: similar to SW compilation Lexical analysis (Lexing) Parsing Code generation – intermediate representation High-level Code Lexical Analysis Syntactic Analysis Parsing Intermediate Representation
20
Intermediate Representation
Parser converts tokens to intermediate representation Usually, an abstract syntax tree Assign x = 0; if (y < z) x = 1; d = 6; x if assign cond assign y < z x 1 d 6
21
Intermediate Representation
Why use intermediate representation? Easier to analyze/optimize than source code Theoretically can be used for all languages Makes synthesis back end language independent C Code Java Perl Syntactic Analysis Syntactic Analysis Syntactic Analysis Intermediate Representation Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too Back End
22
Intermediate Representation
Different Types Abstract Syntax Tree Control/Data Flow Graph (CDFG) Sequencing Graph ... We will focus on CDFG Combines control flow graph (CFG) and data flow graph (DFG)
23
Control Flow Graphs (CFGs)
Represents control flow dependencies of basic blocks A basic block is a section of code that always executes from beginning to end I.e. no jumps into or out of block acc=0, i = 0 acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128? no yes acc += a[i] i ++ Done
24
Control Flow Graphs: Your Turn
Find a CFG for the following code. i = 0; while (i < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; i++; }
25
Data Flow Graphs Represents data dependencies between operations a c b
x = a+b; y = c*d; z = x - y; + * - x z y
26
Control/Data Flow Graph
Combines CFG and DFG Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc i acc=0; i=0; if (i < 128) acc a[i] i 1 acc += a[i] i ++ Done + + acc i
27
Optimization
28
Synthesis Optimizations
After creating CDFG, high-level synthesis optimizes it with the following goals Reduce area Improve latency Increase parallelism Reduce power/energy 2 types of optimizations Data flow optimizations Control flow optimizations
29
Data Flow Optimizations
Tree-height reduction Generally made possible from commutativity, associativity, and distributivity x = a + b + c + d a b c d a b c d + + + + + + a b c d a b c d * + * + + +
30
Data Flow Optimizations
Operator Strength Reduction Replacing an expensive (“strong”) operation with a faster one Common example: replacing multiply/divide with shift 1 multiplication 0 multiplications b[i] = a[i] * 8; b[i] = a[i] << 3; c = b << 2; a = b + c; a = b * 5; c = b << 2; d = b << 3; a = c + d + b; a = b * 13;
31
Data Flow Optimizations
Constant propagation Statically evaluate expressions with constants x = 0; y = x * 15; z = y + 10; x = 0; y = 0; z = 10;
32
Data Flow Optimizations
Function Specialization Create specialized code for common inputs Treat common inputs as constants If inputs not known statically, must include if statement for each call to specialized function int f (int x) { y = x * 15; return y + 10; } int f (int x) { y = x * 15; return y + 10; } int f_opt () { return 10; } Treat frequent input as a constant for (I=0; I < 1000; I++) f(0); … } for (I=0; I < 1000; I++) f_opt(); … }
33
Data Flow Optimizations
Common sub-expression elimination If expression appears more than once, repetitions can be replaced a = x + y; b = c * 25 + x + y; a = x + y; b = c * 25 + a; x + y already determined
34
Data Flow Optimizations
Dead code elimination Remove code that is never executed May seem like stupid code, but often comes from constant propagation or function specialization int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - “dead code”
35
Data Flow Optimizations
Code motion (hoisting/sinking) Avoid repeated computation for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ; } z = x + y; for (I=0; I < 100; I++) { b[i] = a[i] + z ; }
36
Control Flow Optimizations
Loop Unrolling Replicate body of loop May increase parallelism for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1] } for (i=0; i < 128; i++) a[i] = b[i] + c[i];
37
Control Flow Optimizations
Function inlining – replace function call with body of function Common for both SW and HW SW: Eliminates function call instructions HW: Eliminates unnecessary control states for (i=0; i < 128; i++) a[i] = f( b[i], c[i] ); int f (int a, int b) { return a + b * 15; } for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;
38
Control Flow Optimizations
Conditional Expansion – replace if with logic expression Execute if/else bodies in parallel y = a*b if (a) x = b+d else x = b*d y = a*b x = a*(b+d) + a’*b*d [DeMicheli] Can be further optimized to: y = a*b x = y + d(a+b)
39
Example Optimize this x = 0; y = a + b; if (x < 15) z = a + b - c;
else z = x + 12; output = z * 12;
40
Scheduling/Resource Allocation
41
Scheduling Scheduling assigns a start time to each operation in DFG
Start times must not violate dependencies in DFG Start times must meet performance constraints Alternatively, resource constraints Performed on the DFG of each CFG node Cannot execute multiple CFG nodes in parallel
42
Examples a b c d a b c d + Cycle1 Cycle1 + + Cycle2 Cycle2 + Cycle3 +
43
Scheduling Problems Several types of scheduling problems Problems:
Usually some combination of performance and resource constraints Problems: Unconstrained Not very useful, every schedule is valid Minimum latency Latency constrained Mininum-latency, resource constrained i.e. find the schedule with the shortest latency, that uses less than a specified # of resources NP-Complete Mininum-resource, latency constrained i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources
44
Minimum Latency Scheduling
ASAP (as soon as possible) algorithm Find a candidate node Candidate is a node whose predecessors have been scheduled and completed (or has no predecessors) Schedule node one cycle later than max cycle of predecessor Repeat until all nodes scheduled a b c d e f g h Cycle1 + + - < Cycle2 * Cycle3 * Cycle4 + Minimum possible latency - 4 cycles
45
Minimum Latency Scheduling
ALAP (as late as possible) algorithm Run ASAP, get minimum latency L Find a candidate Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of successor Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled a b c d e f g h Cycle1 + + - < Cycle4 Cycle3 Cycle2 * Cycle3 * Cycle4 + L = 4 cycles
46
Minimum Latency Scheduling
ALAP (as late as possible) algorithm Run ASAP, get minimum latency L Find a candidate Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of successor Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled a b c d e f g h Cycle1 + + Cycle2 * Cycle3 - * Cycle4 + < L = 4 cycles
47
Minimum Latency Scheduling
ALAP Has to run ASAP first, seems pointless But, many heuristics need the mobility/slack of each operation ASAP gives the earliest possible time for an operation ALAP gives the latest possible time for an operation Slack = difference between earliest and latest possible schedule Slack = 0 implies operation has to be done in the current scheduled cycle The larger the slack, the more options a heuristic has to schedule the operation
48
Latency-Constrained Scheduling
Instead of finding the minimum latency, find latency less than L Solutions: Use ASAP, verify that minimum latency less than L Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)
49
Scheduling with Resource Constraints
Schedule must use less than specified number of resources Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle3 * * + Cycle4 Cycle5 +
50
Scheduling with Resource Constraints
Schedule must use less than specified number of resources Constraints: 2 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 * * + Cycle3 Cycle4 +
51
Mininum-Latency, Resource-Constrained Scheduling
Definition: Given resource constraints, find schedule that has the minimum latency Example: Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle4 Cycle3 + * Cycle5 + Cycle6
52
Mininum-Latency, Resource-Constrained Scheduling
Definition: Given resource constraints, find schedule that has the minimum latency Example: Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle3 Cycle4 + * + Cycle5 Different schedules may use same resources, but have different latencies
53
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Assumes one type of resource Basic Idea Input: graph, # of resources r 1) Label each node by max distance from output i.e. Use path length as priority 2) Determine C, the set of scheduling candidates Candidate if either no predecessors, or predecessors scheduled 3) From C, schedule up to r nodes to current cycle, using label as priority 4) Increment current cycle, repeat from 2) until all nodes scheduled
54
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Example a b c d e f g j k + + - - * * + + r = 3
55
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 1 - Label each node by max distance from output i.e. use path length as priority a b c d e f g j k 3 4 1 4 3 2 2 1 r = 3
56
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 2 - Determine C, the set of scheduling candidates a b c d e f g j k C = 3 4 1 4 3 2 2 Cycle 1 1 r = 3
57
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b c d e f g j k Cycle1 3 4 1 4 3 2 Not scheduled due to lower priority 2 Cycle 1 1 r = 3
58
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 2 a b c d e f g j k Cycle1 3 4 1 4 3 2 C = 2 Cycle 2 1 r = 3
59
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 3 a b c d e f g j k Cycle1 3 4 1 4 Cycle2 3 2 2 Cycle 2 1 r = 3
60
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Skipping to finish a b c d e f g j k Cycle1 3 4 1 4 Cycle2 3 2 Cycle3 2 Cycle4 1 r = 3
61
Mininum-Latency, Resource-Constrained Scheduling
Hu’s is simplified problem Common Extensions: Multiple resource types Multi-cycle operation a b c d Cycle1 + - * Cycle2 /
62
Mininum-Latency, Resource-Constrained Scheduling
List Scheduling - (minimum latency, resource-constrained version) Extension for multiple resource types Basic Idea - Hu’s algorithm for each resource type Input: graph, set of constraints R for each resource type 1) Label nodes based on max distance to output 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors ) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 3) Increment cycle, repeat from 2) until all nodes scheduled
63
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency Step 1 - Label nodes based on max distance to output (not shown, so you can see operations) *nodes given IDs for illustration purposes 2 ALUs (+/-), 2 Multipliers a b c d e f g j k + 4 1 * 2 3 - + * 6 * 5 + 7 - 8
64
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle ,3, + 2 3 4 - 1 * + * 6 * 5 + 7 - 8
65
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle ,3, * 2 3 + 4 - 1 + Cycle1 * 6 * Candidate, but not scheduled due to low priority 5 + 7 - 8
66
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3, 5, + 2 3 4 * - 1 + Cycle1 * 6 * 5 + 7 - 8
67
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3, 5, + 1 * 2 3 4 - + Cycle1 * 6 * 5 Cycle2 + 7 - 8
68
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3, 5, + 1 * 2 3 4 - + Cycle1 * 6 * 5 Cycle2 + 7 - 8
69
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - (minimum latency) Final schedule Note - ASAP would require more resources ALAP wouldn’t but in general, it would 2 ALUs (+/-), 2 Multipliers a b c d e f g j k Cycle1 + 1 * 2 3 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8
70
Mininum-Latency, Resource-Constrained Scheduling
Extension for multicycle operations Same idea (differences shown in red) Input: graph, set of constraints R for each resource type 1) Label nodes based on max cycle latency to output 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors) 4) Schedule up to (Rt - nt) operations from C based on priority, one cycle after predecessor Rt is the constraint on resource type t nt is the number of resource t in use from previous cycles Repeat from 2) until all nodes scheduled
71
Mininum-Latency, Resource-Constrained Scheduling
Example: 2 ALUs (+/-), 2 Multipliers a b c d e f g j k Cycle1 * 2 3 + 1 + Cycle2 * 4 Cycle3 6 * Cycle4 5 * Cycle5 Cycle6 + 7 Cycle7 - 8
72
List Scheduling (Min Latency)
Your turn (2 ALUs, 1 Mult) Steps (will be on test) 1) Label nodes with priority 2) Update candidate list for each cycle 3) Redraw graph to show schedule 6 2 3 * - 5 + + 1 + * 4 * 8 + 7 + - 10 9 - 11
73
List Scheduling (Min Latency)
Your turn (2 ALUs, 1 Mult, Mults take 2 cycles) a b c d e f g 2 3 * * 1 + + 4 * 5 + 6
74
Minimum-Resource, Latency-Constrained
Note that if no resource constraints given, schedule determines number of required resources Max # of each resource type used in a single cycle a b c d e f g 3 ALUs Cycle1 + + - 2 Mults Cycle2 * * + Cycle3 Cycle4 +
75
Minimum-Resource, Latency-Constrained
Minimum-Resource Latency-Constrained Scheduling: For all schedules that have latency less than the constraint, find the one that uses the fewest resources Latency Constraint <= 4 Latency Constraint <= 4 a b c d e f g a b c d e f g Cycle1 + + - Cycle1 + + - Cycle2 * * Cycle2 * * + Cycle3 + Cycle3 Cycle4 + Cycle4 + 2 ALUs, 1 Mult 3 ALUs, 2 Mult
76
Minimum-Resource, Latency-Constrained
List scheduling (Minimum resource version) Basic Idea 1) Compute latest start times for each op using ALAP with specified latency constraint Latest start times must include multicycle operations 2) For each resource type 3) Determine candidate nodes 4) Compute slack for each candidate Slack = current cycle - latest possible cycle 5) Schedule ops with 0 slack Update required number of resources (assume 1 of each to start with) 6) Schedule ops that require no extra resources 7) Repeat from 2) until all nodes scheduled
77
Minimum-Resource, Latency-Constrained
1) Find ALAP schedule a b c d e f g j k + 1 * 2 + 3 4 - Last Possible Cycle 6 * 5 * Node LPC - 7 1 3 2 Latency Constraint = 3 cycles a b c d e f g j k Cycle1 1 * 2 + 3 + Cycle2 * 6 5 * Cycle3 4 7 - - Defines last possible cycle for each operation
78
Minimum-Resource, Latency-Constrained
2) For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Candidates = {1,2,3,4} Node LPC Slack Cycle 1 3 2 2 Initial Resources = 1 Mult, 1 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 1
79
Minimum-Resource, Latency-Constrained
5)Schedule ops with 0 slack Update required number of resources 6) Schedule ops that require no extra resources Node LPC Slack Cycle Candidates = {1,2,3,4} 1 3 2 X Resources = 1 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 4 requires 1 more ALU - not scheduled - 7 Cycle 1
80
Minimum-Resource, Latency-Constrained
2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC Slack Cycle Candidates = {4,5,6} 1 3 2 1 Resources = 1 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 2
81
Minimum-Resource, Latency-Constrained
5)Schedule ops with 0 slack Update required number of resources 6) Schedule ops that require no extra resources Node LPC Slack Cycle Candidates = {4,5,6} 1 3 2 1 Resources = 2 Mult, 2 ALU a b c d e f g j k + 1 * 2 + 3 4 - * 6 * 5 - 7 Already 1 ALU - 4 can be scheduled Cycle 2
82
Minimum-Resource, Latency-Constrained
2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC Slack Cycle Candidates = {7} 1 3 2 1 2 Resources = 2 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 3
83
Minimum-Resource, Latency-Constrained
Final Schedule Required Resources = 2 Mult, 2 ALU Node LPC Slack Cycle a b c d e f g j k 1 3 2 1 2 3 Cycle1 * 2 + 3 1 + Cycle2 * 6 * 4 - 5 Cycle3 - 7
84
Other extensions Chaining Pipelining
Multiple operations in a single cycle Pipelining Input: DFG, data delivery rate For fully pipelined circuit, must have one resource per operation (remember systolic arrays) a b c d e f Multiple adds may be faster than 1 divide - perform adds in one cycle + + / + -
85
Summary Scheduling assigns each operation in a DFG a start time
Done for each DFG in the CDFG Different Types Minimum Latency ASAP, ALAP Latency-constrained Minimum-latency, resource-constrained Hu’s Algorithm List Scheduling Minimum-resource, latency-constrained
86
Binding/Resource Sharing
87
Binding During scheduling, we determined:
When ops will execute How many resources are needed We still need to decide which ops execute on which resources => Binding If multiple ops use the same resource =>Resource Sharing
88
Binding Basic Idea - Map operations onto resources such that operations in same cycle don’t use same resource 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Mult1 ALU1 ALU2 Mult2
89
Binding Many possibilities
Bad binding may increase resources, require huge steering logic, reduce clock, etc. 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Mult1 ALU1 Mult2 ALU2
90
Binding Can’t do this 1 resource can’t perform multiple ops simultaneously! 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8
91
Binding How to automate? Compatibility Graph More graph theory
Each node is an operation Edges represent compatible operations Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.) and are scheduled to different cycles
92
Compatibility Graph ALUs Mults Cycle1 Cycle2 Cycle3 Cycle4
+ 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 and 6 not compatible (same cycle) 5 7 4 2 and 3 not compatible (same cycle) 3
93
Compatibility Graph ALUs Mults
Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3
94
Compatibility Graph ALUs Mults
Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3
95
Compatibility Graph ALUs Mults
Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3
96
Compatibility Graph Binding: Find minimum number of fully connected subgraphs that cover entire graph Well-known problem: Clique partitioning (NP-complete) Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6 2 8 1 6 7 4 5 3
97
Compatibility Graph Final Binding: ALUs Mults Cycle1 Cycle2 Cycle3
* 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 3
98
Compatibility Graph Alternative Final Binding: ALUs Mults Cycle1
* 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 3
99
Translation to Datapath
b c d e f g h i Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Add resources and registers Add mux for each input Add input to left mux for each left input in DFG Do same for right mux If only 1 input, remove mux a b c h d e i g e f Mux Mux Mux Mux Mult(1,5) ALU(2,7,8,4) Mult(6) ALU(3) Reg Reg Reg Reg
100
Left Edge Algorithm Alternative to clique partitioning
Take scheduled DFG, rotate it 90 degrees a b f g * + - j k 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 Cycle4 Cycle3 Cycle7 c d e 2 ALUs (+/-), 2 Multipliers
101
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
102
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
103
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
104
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
105
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
106
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
107
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
108
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
109
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers
* 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N to a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound right_edge * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle7 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6
110
Extensions Algorithms presented so far find a valid binding
But, do not consider amount of steering logic required Different bindings can require significantly different # of muxes One solution Extend compatibility graph Use weighted edges/nodes - cost function representing steering logic Perform clique partitioning, finding the set of cliques that minimize weight
111
Binding Summary Binding maps operations onto physical resources
Determines sharing among resources Binding may greatly affect steering logic Trivial for fully-pipelined circuits 1 resource per operation Straightforward translation from bound DFG to datapath
112
Summary
113
Main Steps Front-end (lexing/parsing) converts code into intermediate representation We looked at CDFG Scheduling assigns a start time for each operation in DFG CFG node start times defined by control dependencies Resource allocation determined by schedule Binding maps scheduled operations onto physical resources Determines how resources are shared Big picture: Scheduled/Bound DFG can be translated into a datapath CFG can be translated to a controller => High-level synthesis can create a custom circuit for any CDFG!
114
Limitations Task-level parallelism
Parallelism in CDFG limited to individual control states Can’t have multiple states executing concurrently Potential solution: use model other than CDFG Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes High-level synthesis can create a controller+datapath for each process Must also consider communication buffers Challenge: Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code
115
Limitations Coding practices limit circuit performance
Very often, languages contain constructs not appropriate for circuit implementation Recursion, pointers, virtual functions, etc. Potential solution: use specialized languages Remove problematic constructs, add task-level parallelism Challenge: Difficult to learn new languages Many designers resist changes to tool flow
116
Limitations Expert designers can achieve better circuits
High-level synthesis has to work with specification in code Can be difficult to automatically create efficient pipeline May require dozens of optimizations applied in a particular order Expert designer can transform algorithm Synthesis can transform code, but can’t change algorithm Potential Solution: ??? New language? New methodology? New tools?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.