Download presentation
Presentation is loading. Please wait.
Published byJane Woods Modified over 8 years ago
1
High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida
2
Existing FPGA Tool Flow Register-transfer (RT) synthesis - Specify RT structure (muxes, registers, etc) - Allows precise specification - But, time consuming, difficult, error prone Synthesizable HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing
3
Future FPGA Tool Flow HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing High-level Synthesis C/C++, Java, etc. Synthesizable HDL
4
High-level Synthesis - Benefits Ratio of C to VHDL developers (10000:1 ?) Easier to specify complex designs Technology/architecture independent designs Manual HW design potentially slower -Similar to assembly code era -Programmers could always beat compiler -But, no longer the case Ease of HW/SW partitioning -enhance overall system efficiency More efficient verification and validation -Easier to V & V of high-level code
5
High-level Synthesis More challenging than SW compilation -Compilation maps behavior into assembly instructions -Architecture is known to compiler HLS creates a custom architecture to execute specified behavior -Huge hardware exploration space -Best solution may include microprocessors -Ideally, should handle any high-level code + But, not all code appropriate for hardware
6
High-level Synthesis: An Example First, consider how to manually convert high-level code into circuit Steps 1)Build FSM for controller 2)Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i];
7
A Manual Example Build a FSM (controller) - Decompose code into states acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128
8
A Manual Example Build a datapath - Allocate resources for each state acc i acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 1 acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128 i >= 128
9
A Manual Example Build a datapath - Determine register inputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128 acc = 0; for (i=0; i < 128; i++) acc += a[i];
10
A Manual Example Build a datapath - Add outputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128
11
A Manual Example Build a datapath - Add control signals acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128
12
A Manual Example Combine controller + datapath acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller acc = 0; for (i=0; i < 128; i++) acc += a[i]; Start
13
A Manual Example - Optimization Alternatives -Use one adder (plus muxes) acc i < addr a[i] + 128 2x1 0 0 1 &a In from memory Memory address acc MUX
14
A Manual Example – Summary Comparison with high-level synthesis - Determining when to perform each operation => Scheduling - Allocating resource for each operation => Resource allocation - Mapping operations onto resources => Binding
15
Another Example: Try it at home Your turn x=0; for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --; a[i] = x; } //output x 1)Build FSM (do not perform if conversion) 2)Build datapath based on FSM
16
High-Level Synthesis Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc. Usually a RT VHDL/Verilog description, but could as low level as a bit file high-level code Custom Circuit
17
High-Level Synthesis – Overview High-Level Synthesis acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller
18
Main Steps Syntactic Analysis Optimization Scheduling/Resource Allocation Binding/Resource Sharing High-level Code Intermediate Representation Cycle accurate RTL code Converts code to intermediate representation - allows all following steps to use language independent format. Determines when each operation will execute, and resources used Maps operations onto physical resources Front-end Back-end
19
Syntactic Analysis Definition: Analysis of code to verify syntactic correctness -Converts code into intermediate representation Steps: similar to SW compilation 1)Lexical analysis (Lexing) 2)Parsing 3)Code generation – intermediate representation Syntactic Analysis High-level Code Intermediate Representation Lexical Analysis Parsing
20
Intermediate Representation Parser converts tokens to intermediate representation -Usually, an abstract syntax tree x = 0; if (y < z) x = 1; d = 6; Assign if cond assign x 0 x 1 d 6 y < z
21
Intermediate Representation Why use intermediate representation? - Easier to analyze/optimize than source code - Theoretically can be used for all languages + Makes synthesis back end language independent Syntactic Analysis C Code Intermediate Representation Syntactic Analysis Java Syntactic Analysis Perl Back End Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too
22
Intermediate Representation Different Types - Abstract Syntax Tree - Control/Data Flow Graph (CDFG) - Sequencing Graph +... We will focus on CDFG - Combines control flow graph (CFG) and data flow graph (DFG)
23
Control Flow Graphs (CFGs) Represents control flow dependencies of basic blocks A basic block is a section of code that always executes from beginning to end + I.e. no jumps into or out of block acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128? acc=0, i = 0 acc += a[i] i ++ Done yes no
24
Control Flow Graphs: Your Turn Find a CFG for the following code. i = 0; while (i < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; i++; }
25
Data Flow Graphs Represents data dependencies between operations x = a+b; y = c*d; z = x - y; + * - a b c d x z y
26
Control/Data Flow Graph Combines CFG and DFG - Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0; i=0; acc += a[i] i ++ Done acc 0 i 0 + a[i] acc + i 1 i
27
Optimization
28
Synthesis Optimizations After creating CDFG, high-level synthesis optimizes it with the following goals -Reduce area -Improve latency -Increase parallelism -Reduce power/energy 2 types of optimizations -Data flow optimizations -Control flow optimizations
29
Tree-height reduction -Generally made possible from commutativity, associativity, and distributivity Data Flow Optimizations + + + + + + a b cd a b cd + + * a b cd + * + a b cd x = a + b + c + d
30
Operator Strength Reduction -Replacing an expensive (“strong”) operation with a faster one -Common example: replacing multiply/divide with shift Data Flow Optimizations b[i] = a[i] * 8; b[i] = a[i] << 3; a = b * 5; c = b << 2; a = b + c; 1 multiplication 0 multiplications a = b * 13; c = b << 2; d = b << 3; a = c + d + b;
31
Constant propagation -Statically evaluate expressions with constants Data Flow Optimizations x = 0; y = x * 15; z = y + 10; x = 0; y = 0; z = 10;
32
Function Specialization -Create specialized code for common inputs + Treat common inputs as constants + If inputs not known statically, must include if statement for each call to specialized function Data Flow Optimizations int f (int x) { y = x * 15; return y + 10; } for (I=0; I < 1000; I++) f(0); … } int f_opt () { return 10; } for (I=0; I < 1000; I++) f_opt(); … } Treat frequent input as a constant int f (int x) { y = x * 15; return y + 10; }
33
Data Flow Optimizations Common sub-expression elimination -If expression appears more than once, repetitions can be replaced a = x + y;...... b = c * 25 + x + y; a = x + y;...... b = c * 25 + a; x + y already determined
34
Data Flow Optimizations Dead code elimination -Remove code that is never executed + May seem like stupid code, but often comes from constant propagation or function specialization int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - “dead code”
35
Data Flow Optimizations Code motion (hoisting/sinking) -Avoid repeated computation for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ; } z = x + y; for (I=0; I < 100; I++) { b[i] = a[i] + z ; }
36
Control Flow Optimizations Loop Unrolling -Replicate body of loop + May increase parallelism for (i=0; i < 128; i++) a[i] = b[i] + c[i]; for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1] }
37
Control Flow Optimizations Function inlining – replace function call with body of function -Common for both SW and HW -SW: Eliminates function call instructions -HW: Eliminates unnecessary control states for (i=0; i < 128; i++) a[i] = f( b[i], c[i] );.. int f (int a, int b) { return a + b * 15; } for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;
38
Conditional Expansion – replace if with logic expression -Execute if/else bodies in parallel Control Flow Optimizations y = ab if (a) x = b+d else x = bd y = ab x = a(b+d) + a‘bd y = ab x = y + d(a+b) [DeMicheli] Can be further optimized to:
39
Example Optimize this x = 0; y = a + b; if (x < 15) z = a + b - c; else z = x + 12; output = z * 12;
40
Scheduling/Resource Allocation
41
Scheduling Scheduling assigns a start time to each operation in DFG -Start times must not violate dependencies in DFG -Start times must meet performance constraints +Alternatively, resource constraints Performed on the DFG of each CFG node - Cannot execute multiple CFG nodes in parallel
42
Examples + + + + + + a b cd a b cd + + + a b cd Cycle1 Cycle2 Cycle3 Cycle1 Cycle2 Cycle1 Cycle2
43
Scheduling Problems Several types of scheduling problems - Usually some combination of performance and resource constraints Problems: - Unconstrained + Not very useful, every schedule is valid - Minimum latency - Latency constrained - Mininum-latency, resource constrained + i.e. find the schedule with the shortest latency, that uses less than a specified # of resources + NP-Complete - Mininum-resource, latency constrained + i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources + NP-Complete
44
Minimum Latency Scheduling ASAP (as soon as possible) algorithm - Find a candidate node + Candidate is a node whose predecessors have been scheduled and completed (or has no predecessors) - Schedule node one cycle later than max cycle of predecessor - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 Minimum possible latency - 4 cycles
45
Minimum Latency Scheduling ALAP (as late as possible) algorithm - Run ASAP, get minimum latency L - Find a candidate + Candidate is node whose successors are scheduled (or has none) - Schedule node one cycle before min cycle of successor + Nodes with no successors scheduled to cycle L - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 L = 4 cycles
46
Minimum Latency Scheduling ALAP - Has to run ASAP first, seems pointless - But, many heuristics need the mobility/slack of each operation + ASAP gives the earliest possible time for an operation + ALAP gives the latest possible time for an operation - Slack = difference between earliest and latest possible schedule + Slack = 0 implies operation has to be done in the current scheduled cycle + The larger the slack, the more options a heuristic has to schedule the operation
47
Latency-Constrained Scheduling Instead of finding the minimum latency, find latency less than L Solutions: - Use ASAP, verify that minimum latency less than L. - Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)
48
Schedule must use less than specified number of resources Scheduling with Resource Constraints + * a b cd + - e f g Cycle1 Cycle3 Cycle4 + Cycle5 + * Cycle2 Constraints: 1 ALU (+/-), 1 Multiplier
49
Scheduling with Resource Constraints Schedule must use less than specified number of resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * Constraints: 2 ALU (+/-), 1 Multiplier
50
Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +
51
Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +
52
Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +
53
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Assumes one type of resource Basic Idea - Input: graph, # of resources r - 1) Label each node by max distance from output + i.e. Use path length as priority - 2) Determine C, the set of scheduling candidates + Candidate if either no predecessors, or predecessors scheduled - 3) From C, schedule up to r nodes to current cycle, using label as priority - 4) Increment current cycle, repeat from 2) until all nodes scheduled
54
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Example + + * a b cd + - e f g + * j k - r = 3
55
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 1 - Label each node by max distance from output + i.e. use path length as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3
56
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 - Determine C, the set of scheduling candidates a b cd e f g 4 4 3 2 3 2 1 j k 1 C = r = 3 Cycle 1
57
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Not scheduled due to lower priority Cycle1
58
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 C = Cycle1 Cycle 2
59
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2
60
Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Skipping to finish a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2 Cycle3 Cycle4
61
Mininum-Latency, Resource-Constrained Scheduling Hu’s is simplified problem Common Extensions: - Multiple resource types - Multi-cycle operation a b cd + - / * Cycle1 Cycle2
62
Mininum-Latency, Resource-Constrained Scheduling List Scheduling - (minimum latency, resource- constrained version) - Extension for multiple resource types - Basic Idea - Hu’s algorithm for each resource type + Input: graph, set of constraints R for each resource type + 1) Label nodes based on max distance to output + 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors ) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t + 3) Increment cycle, repeat from 2) until all nodes scheduled
63
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency - Step 1 - Label nodes based on max distance to output (not shown, so you can see operations) - *nodes given IDs for illustration purposes a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 2 ALUs (+/-), 2 Multipliers
64
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates
65
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Candidate, but not scheduled due to low priority
66
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1
67
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2
68
Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 7 3 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2
69
Mininum-Latency, Resource-Constrained Scheduling List scheduling - (minimum latency) - Final schedule - Note - ASAP would require more resources + ALAP wouldn’t but in general, it would a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers
70
Mininum-Latency, Resource-Constrained Scheduling Extension for multicycle operations - Same idea (differences shown in red) - Input: graph, set of constraints R for each resource type - 1) Label nodes based on max cycle latency to output - 2) For each resource type t + 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors) + 4) Schedule up to (R t - n t ) operations from C based on priority, one cycle after predecessor R t is the constraint on resource type t n t is the number of resource t in use from previous cycles - Repeat from 2) until all nodes scheduled
71
Mininum-Latency, Resource-Constrained Scheduling Example: a b cd e f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 2 ALUs (+/-), 2 Multipliers Cycle4 Cycle3 Cycle7
72
List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult) Steps (will be on test) - 1) Label nodes with priority - 2) Update candidate list for each cycle - 3) Redraw graph to show schedule + * * - + + 1 2 3 7 4 9 - + + * 5 6 8 - 10 11
73
List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult, Mults take 2 cycles) a b cd e f g + + * * * + 1 2 3 5 4 6
74
Minimum-Resource, Latency-Constrained Note that if no resource constraints given, schedule determines number of required resources - Max # of each resource type used in a single cycle + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs 2 Mults
75
Minimum-Resource, Latency-Constrained Minimum-Resource Latency-Constrained Scheduling: - For all schedules that have latency less than the constraint, find the one that uses the fewest resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs, 2 Mult 2 ALUs, 1 Mult Latency Constraint <= 4
76
Minimum-Resource, Latency-Constrained List scheduling (Minimum resource version) Basic Idea - 1) Compute latest start times for each op using ALAP with specified latency constraint + Latest start times must include multicycle operations - 2) For each resource type + 3) Determine candidate nodes + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle + 5) Schedule ops with 0 slack Update required number of resources (assume 1 of each to start with) + 6) Schedule ops that require no extra resources - 7) Repeat from 2) until all nodes scheduled
77
Minimum-Resource, Latency-Constrained 1) Find ALAP schedule a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Last Possible Cycle a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Defines last possible cycle for each operation Latency Constraint = 3 cycles
78
Minimum-Resource, Latency-Constrained 2) For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 00020002 Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Initial Resources = 1 Mult, 1 ALU
79
Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 0 1 2 X Resources = 1 Mult, 2 ALU 4 requires 1 more ALU - not scheduled Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7
80
Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 0 Resources = 1 Mult, 2 ALU Cycle 2 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7
81
Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 1 2 0 2 Resources = 2 Mult, 2 ALU Cycle 2 Already 1 ALU - 4 can be scheduled a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7
82
Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {7} Cycle 1 2 0 Resources = 2 Mult, 2 ALU Cycle 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7
83
Minimum-Resource, Latency-Constrained Final Schedule Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Cycle 1 2 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Required Resources = 2 Mult, 2 ALU
84
Other extensions Chaining - Multiple operations in a single cycle Pipelining - Input: DFG, data delivery rate - For fully pipelined circuit, must have one resource per operation (remember systolic arrays) a b cd + + + - e f / Multiple adds may be faster than 1 divide - perform adds in one cycle
85
Summary Scheduling assigns each operation in a DFG a start time - Done for each DFG in the CDFG Different Types - Minimum Latency + ASAP, ALAP - Latency-constrained + ASAP, ALAP - Minimum-latency, resource-constrained + Hu’s Algorithm + List Scheduling - Minimum-resource, latency-constrained + List Scheduling
86
Binding/Resource Sharing
87
Binding During scheduling, we determined: - When operations will execute - How many resources are needed We still need to decide which operations execute on which resources – binding - If multiple operations use the same resource, we need to decide how resources are shared - resource sharing.
88
Binding Map operations onto resources such that operations in same cycle do not use same resource * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2Mult2
89
Binding Many possibilities - Bad binding may increase resources, require huge steering logic, reduce clock, etc. * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2 Mult2
90
Binding Can’t do this - 1 resource can’t perform multiple ops simultaneously! * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers
91
Binding How to automate? - More graph theory Compatibility Graph - Each node is an operation - Edges represent compatible operations + Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.) and are scheduled to different cycles
92
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults 5 and 6 not compatible (same cycle) 2 and 3 not compatible (same cycle)
93
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
94
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
95
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
96
Compatibility Graph Binding: Find minimum number of fully connected subgraphs that cover entire graph - Well-known problem: Clique partitioning (NP-complete) + Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6 28 74 3 1 6 5
97
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Final Binding:
98
Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Alternative Final Binding:
99
Translation to Datapath * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 Mult(1,5)Mult(6)ALU(2,7,8,4)ALU(3) Mux Reg Mux a b c h Reg abcdefghi de i g e f 1)Add resources and registers 2)Add mux for each input 3)Add input to left mux for each left input in DFG 4)Do same for right mux 5)If only 1 input, remove mux
100
Left Edge Algorithm Alternative to clique partitioning - Take scheduled DFG, rotate it 90 degrees 2 ALUs (+/-), 2 Multipliers a b f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 Cycle4 Cycle3 Cycle7 c d e
101
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
102
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
103
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
104
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
105
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
106
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
107
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
108
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
109
Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge
110
Extensions Algorithms presented so far find a valid binding - But, do not consider amount of steering logic required - Different bindings can require significantly different # of muxes One solution - Extend compatibility graph + Use weighted edges/nodes - cost function representing steering logic + Perform clique partitioning, finding the set of cliques that minimize weight
111
Binding Summary Binding maps operations onto physical resources - Determines sharing among resources Binding may greatly affect steering logic Trivial for fully-pipelined circuits - 1 resource per operation Straightforward translation from bound DFG to datapath
112
Summary
113
Main Steps Front-end (lexing/parsing) converts code into intermediate representation - We looked at CDFG Scheduling assigns a start time for each operation in DFG - CFG node start times defined by control dependencies - Resource allocation determined by schedule Binding maps scheduled operations onto physical resources - Determines how resources are shared Big picture: - Scheduled/Bound DFG can be translated into a datapath - CFG can be translated to a controller - => High-level synthesis can create a custom circuit for any CDFG!
114
Limitations Task-level parallelism - Parallelism in CDFG limited to individual control states + Can’t have multiple states executing concurrently - Potential solution: use model other than CDFG + Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes + High-level synthesis can create a controller+datapath for each process Must also consider communication buffers - Challenge: + Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code
115
Limitations Coding practices limit circuit performance - Very often, languages contain constructs not appropriate for circuit implementation + Recursion, pointers, virtual functions, etc. Potential solution: use specialized languages - Remove problematic constructs, add task-level parallelism Challenge: - Difficult to learn new languages - Many designers resist changes to tool flow
116
Limitations Expert designers can achieve better circuits - High-level synthesis has to work with specification in code + Can be difficult to automatically create efficient pipeline + May require dozens of optimizations applied in a particular order - Expert designer can transform algorithm + Synthesis can transform code, but can’t change algorithm Potential Solution: ??? - New language? - New methodology? - New tools?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.