Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida.

Similar presentations


Presentation on theme: "High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida."— Presentation transcript:

1 High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida

2 Existing FPGA Tool Flow Register-transfer (RT) synthesis - Specify RT structure (muxes, registers, etc) - Allows precise specification - But, time consuming, difficult, error prone Synthesizable HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing

3 Future FPGA Tool Flow HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing High-level Synthesis C/C++, Java, etc. Synthesizable HDL

4 High-level Synthesis - Benefits Ratio of C to VHDL developers (10000:1 ?) Easier to specify complex designs Technology/architecture independent designs Manual HW design potentially slower -Similar to assembly code era -Programmers could always beat compiler -But, no longer the case Ease of HW/SW partitioning -enhance overall system efficiency More efficient verification and validation -Easier to V & V of high-level code

5 High-level Synthesis More challenging than SW compilation -Compilation maps behavior into assembly instructions -Architecture is known to compiler HLS creates a custom architecture to execute specified behavior -Huge hardware exploration space -Best solution may include microprocessors -Ideally, should handle any high-level code + But, not all code appropriate for hardware

6 High-level Synthesis: An Example First, consider how to manually convert high-level code into circuit Steps 1)Build FSM for controller 2)Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i];

7 A Manual Example Build a FSM (controller) - Decompose code into states acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

8 A Manual Example Build a datapath - Allocate resources for each state acc i acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 1 acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128 i >= 128

9 A Manual Example Build a datapath - Determine register inputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128 acc = 0; for (i=0; i < 128; i++) acc += a[i];

10 A Manual Example Build a datapath - Add outputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

11 A Manual Example Build a datapath - Add control signals acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

12 A Manual Example Combine controller + datapath acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller acc = 0; for (i=0; i < 128; i++) acc += a[i]; Start

13 A Manual Example - Optimization Alternatives -Use one adder (plus muxes) acc i < addr a[i] + 128 2x1 0 0 1 &a In from memory Memory address acc MUX

14 A Manual Example – Summary Comparison with high-level synthesis - Determining when to perform each operation => Scheduling - Allocating resource for each operation => Resource allocation - Mapping operations onto resources => Binding

15 Another Example: Try it at home Your turn x=0; for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --; a[i] = x; } //output x 1)Build FSM (do not perform if conversion) 2)Build datapath based on FSM

16 High-Level Synthesis Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc. Usually a RT VHDL/Verilog description, but could as low level as a bit file high-level code Custom Circuit

17 High-Level Synthesis – Overview High-Level Synthesis acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller

18 Main Steps Syntactic Analysis Optimization Scheduling/Resource Allocation Binding/Resource Sharing High-level Code Intermediate Representation Cycle accurate RTL code Converts code to intermediate representation - allows all following steps to use language independent format. Determines when each operation will execute, and resources used Maps operations onto physical resources Front-end Back-end

19 Syntactic Analysis Definition: Analysis of code to verify syntactic correctness -Converts code into intermediate representation Steps: similar to SW compilation 1)Lexical analysis (Lexing) 2)Parsing 3)Code generation – intermediate representation Syntactic Analysis High-level Code Intermediate Representation Lexical Analysis Parsing

20 Intermediate Representation Parser converts tokens to intermediate representation -Usually, an abstract syntax tree x = 0; if (y < z) x = 1; d = 6; Assign if cond assign x 0 x 1 d 6 y < z

21 Intermediate Representation Why use intermediate representation? - Easier to analyze/optimize than source code - Theoretically can be used for all languages + Makes synthesis back end language independent Syntactic Analysis C Code Intermediate Representation Syntactic Analysis Java Syntactic Analysis Perl Back End Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too

22 Intermediate Representation Different Types - Abstract Syntax Tree - Control/Data Flow Graph (CDFG) - Sequencing Graph +... We will focus on CDFG - Combines control flow graph (CFG) and data flow graph (DFG)

23 Control Flow Graphs (CFGs) Represents control flow dependencies of basic blocks A basic block is a section of code that always executes from beginning to end + I.e. no jumps into or out of block acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128? acc=0, i = 0 acc += a[i] i ++ Done yes no

24 Control Flow Graphs: Your Turn Find a CFG for the following code. i = 0; while (i < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; i++; }

25 Data Flow Graphs Represents data dependencies between operations x = a+b; y = c*d; z = x - y; + * - a b c d x z y

26 Control/Data Flow Graph Combines CFG and DFG - Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0; i=0; acc += a[i] i ++ Done acc 0 i 0 + a[i] acc + i 1 i

27 Optimization

28 Synthesis Optimizations After creating CDFG, high-level synthesis optimizes it with the following goals -Reduce area -Improve latency -Increase parallelism -Reduce power/energy 2 types of optimizations -Data flow optimizations -Control flow optimizations

29 Tree-height reduction -Generally made possible from commutativity, associativity, and distributivity Data Flow Optimizations + + + + + + a b cd a b cd + + * a b cd + * + a b cd x = a + b + c + d

30 Operator Strength Reduction -Replacing an expensive (“strong”) operation with a faster one -Common example: replacing multiply/divide with shift Data Flow Optimizations b[i] = a[i] * 8; b[i] = a[i] << 3; a = b * 5; c = b << 2; a = b + c; 1 multiplication 0 multiplications a = b * 13; c = b << 2; d = b << 3; a = c + d + b;

31 Constant propagation -Statically evaluate expressions with constants Data Flow Optimizations x = 0; y = x * 15; z = y + 10; x = 0; y = 0; z = 10;

32 Function Specialization -Create specialized code for common inputs + Treat common inputs as constants + If inputs not known statically, must include if statement for each call to specialized function Data Flow Optimizations int f (int x) { y = x * 15; return y + 10; } for (I=0; I < 1000; I++) f(0); … } int f_opt () { return 10; } for (I=0; I < 1000; I++) f_opt(); … } Treat frequent input as a constant int f (int x) { y = x * 15; return y + 10; }

33 Data Flow Optimizations Common sub-expression elimination -If expression appears more than once, repetitions can be replaced a = x + y;...... b = c * 25 + x + y; a = x + y;...... b = c * 25 + a; x + y already determined

34 Data Flow Optimizations Dead code elimination -Remove code that is never executed + May seem like stupid code, but often comes from constant propagation or function specialization int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - “dead code”

35 Data Flow Optimizations Code motion (hoisting/sinking) -Avoid repeated computation for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ; } z = x + y; for (I=0; I < 100; I++) { b[i] = a[i] + z ; }

36 Control Flow Optimizations Loop Unrolling -Replicate body of loop + May increase parallelism for (i=0; i < 128; i++) a[i] = b[i] + c[i]; for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1] }

37 Control Flow Optimizations Function inlining – replace function call with body of function -Common for both SW and HW -SW: Eliminates function call instructions -HW: Eliminates unnecessary control states for (i=0; i < 128; i++) a[i] = f( b[i], c[i] );.. int f (int a, int b) { return a + b * 15; } for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;

38 Conditional Expansion – replace if with logic expression -Execute if/else bodies in parallel Control Flow Optimizations y = ab if (a) x = b+d else x = bd y = ab x = a(b+d) + a‘bd y = ab x = y + d(a+b) [DeMicheli] Can be further optimized to:

39 Example Optimize this x = 0; y = a + b; if (x < 15) z = a + b - c; else z = x + 12; output = z * 12;

40 Scheduling/Resource Allocation

41 Scheduling Scheduling assigns a start time to each operation in DFG -Start times must not violate dependencies in DFG -Start times must meet performance constraints +Alternatively, resource constraints Performed on the DFG of each CFG node - Cannot execute multiple CFG nodes in parallel

42 Examples + + + + + + a b cd a b cd + + + a b cd Cycle1 Cycle2 Cycle3 Cycle1 Cycle2 Cycle1 Cycle2

43 Scheduling Problems Several types of scheduling problems - Usually some combination of performance and resource constraints Problems: - Unconstrained + Not very useful, every schedule is valid - Minimum latency - Latency constrained - Mininum-latency, resource constrained + i.e. find the schedule with the shortest latency, that uses less than a specified # of resources + NP-Complete - Mininum-resource, latency constrained + i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources + NP-Complete

44 Minimum Latency Scheduling ASAP (as soon as possible) algorithm - Find a candidate node + Candidate is a node whose predecessors have been scheduled and completed (or has no predecessors) - Schedule node one cycle later than max cycle of predecessor - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 Minimum possible latency - 4 cycles

45 Minimum Latency Scheduling ALAP (as late as possible) algorithm - Run ASAP, get minimum latency L - Find a candidate + Candidate is node whose successors are scheduled (or has none) - Schedule node one cycle before min cycle of successor + Nodes with no successors scheduled to cycle L - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 L = 4 cycles

46 Minimum Latency Scheduling ALAP - Has to run ASAP first, seems pointless - But, many heuristics need the mobility/slack of each operation + ASAP gives the earliest possible time for an operation + ALAP gives the latest possible time for an operation - Slack = difference between earliest and latest possible schedule + Slack = 0 implies operation has to be done in the current scheduled cycle + The larger the slack, the more options a heuristic has to schedule the operation

47 Latency-Constrained Scheduling Instead of finding the minimum latency, find latency less than L Solutions: - Use ASAP, verify that minimum latency less than L. - Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)

48 Schedule must use less than specified number of resources Scheduling with Resource Constraints + * a b cd + - e f g Cycle1 Cycle3 Cycle4 + Cycle5 + * Cycle2 Constraints: 1 ALU (+/-), 1 Multiplier

49 Scheduling with Resource Constraints Schedule must use less than specified number of resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * Constraints: 2 ALU (+/-), 1 Multiplier

50 Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +

51 Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +

52 Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +

53 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Assumes one type of resource Basic Idea - Input: graph, # of resources r - 1) Label each node by max distance from output + i.e. Use path length as priority - 2) Determine C, the set of scheduling candidates + Candidate if either no predecessors, or predecessors scheduled - 3) From C, schedule up to r nodes to current cycle, using label as priority - 4) Increment current cycle, repeat from 2) until all nodes scheduled

54 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Example + + * a b cd + - e f g + * j k - r = 3

55 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 1 - Label each node by max distance from output + i.e. use path length as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3

56 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 - Determine C, the set of scheduling candidates a b cd e f g 4 4 3 2 3 2 1 j k 1 C = r = 3 Cycle 1

57 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Not scheduled due to lower priority Cycle1

58 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 C = Cycle1 Cycle 2

59 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2

60 Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Skipping to finish a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2 Cycle3 Cycle4

61 Mininum-Latency, Resource-Constrained Scheduling Hu’s is simplified problem Common Extensions: - Multiple resource types - Multi-cycle operation a b cd + - / * Cycle1 Cycle2

62 Mininum-Latency, Resource-Constrained Scheduling List Scheduling - (minimum latency, resource- constrained version) - Extension for multiple resource types - Basic Idea - Hu’s algorithm for each resource type + Input: graph, set of constraints R for each resource type + 1) Label nodes based on max distance to output + 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors ) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t + 3) Increment cycle, repeat from 2) until all nodes scheduled

63 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency - Step 1 - Label nodes based on max distance to output (not shown, so you can see operations) - *nodes given IDs for illustration purposes a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 2 ALUs (+/-), 2 Multipliers

64 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates

65 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Candidate, but not scheduled due to low priority

66 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1

67 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2

68 Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 7 3 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2

69 Mininum-Latency, Resource-Constrained Scheduling List scheduling - (minimum latency) - Final schedule - Note - ASAP would require more resources + ALAP wouldn’t but in general, it would a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers

70 Mininum-Latency, Resource-Constrained Scheduling Extension for multicycle operations - Same idea (differences shown in red) - Input: graph, set of constraints R for each resource type - 1) Label nodes based on max cycle latency to output - 2) For each resource type t + 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors) + 4) Schedule up to (R t - n t ) operations from C based on priority, one cycle after predecessor R t is the constraint on resource type t n t is the number of resource t in use from previous cycles - Repeat from 2) until all nodes scheduled

71 Mininum-Latency, Resource-Constrained Scheduling Example: a b cd e f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 2 ALUs (+/-), 2 Multipliers Cycle4 Cycle3 Cycle7

72 List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult) Steps (will be on test) - 1) Label nodes with priority - 2) Update candidate list for each cycle - 3) Redraw graph to show schedule + * * - + + 1 2 3 7 4 9 - + + * 5 6 8 - 10 11

73 List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult, Mults take 2 cycles) a b cd e f g + + * * * + 1 2 3 5 4 6

74 Minimum-Resource, Latency-Constrained Note that if no resource constraints given, schedule determines number of required resources - Max # of each resource type used in a single cycle + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs 2 Mults

75 Minimum-Resource, Latency-Constrained Minimum-Resource Latency-Constrained Scheduling: - For all schedules that have latency less than the constraint, find the one that uses the fewest resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs, 2 Mult 2 ALUs, 1 Mult Latency Constraint <= 4

76 Minimum-Resource, Latency-Constrained List scheduling (Minimum resource version) Basic Idea - 1) Compute latest start times for each op using ALAP with specified latency constraint + Latest start times must include multicycle operations - 2) For each resource type + 3) Determine candidate nodes + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle + 5) Schedule ops with 0 slack Update required number of resources (assume 1 of each to start with) + 6) Schedule ops that require no extra resources - 7) Repeat from 2) until all nodes scheduled

77 Minimum-Resource, Latency-Constrained 1) Find ALAP schedule a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Last Possible Cycle a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Defines last possible cycle for each operation Latency Constraint = 3 cycles

78 Minimum-Resource, Latency-Constrained 2) For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 00020002 Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Initial Resources = 1 Mult, 1 ALU

79 Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 0 1 2 X Resources = 1 Mult, 2 ALU 4 requires 1 more ALU - not scheduled Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

80 Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 0 Resources = 1 Mult, 2 ALU Cycle 2 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

81 Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 1 2 0 2 Resources = 2 Mult, 2 ALU Cycle 2 Already 1 ALU - 4 can be scheduled a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

82 Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {7} Cycle 1 2 0 Resources = 2 Mult, 2 ALU Cycle 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

83 Minimum-Resource, Latency-Constrained Final Schedule Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Cycle 1 2 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Required Resources = 2 Mult, 2 ALU

84 Other extensions Chaining - Multiple operations in a single cycle Pipelining - Input: DFG, data delivery rate - For fully pipelined circuit, must have one resource per operation (remember systolic arrays) a b cd + + + - e f / Multiple adds may be faster than 1 divide - perform adds in one cycle

85 Summary Scheduling assigns each operation in a DFG a start time - Done for each DFG in the CDFG Different Types - Minimum Latency + ASAP, ALAP - Latency-constrained + ASAP, ALAP - Minimum-latency, resource-constrained + Hu’s Algorithm + List Scheduling - Minimum-resource, latency-constrained + List Scheduling

86 Binding/Resource Sharing

87 Binding During scheduling, we determined: - When operations will execute - How many resources are needed We still need to decide which operations execute on which resources – binding - If multiple operations use the same resource, we need to decide how resources are shared - resource sharing.

88 Binding Map operations onto resources such that operations in same cycle do not use same resource * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2Mult2

89 Binding Many possibilities - Bad binding may increase resources, require huge steering logic, reduce clock, etc. * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2 Mult2

90 Binding Can’t do this - 1 resource can’t perform multiple ops simultaneously! * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers

91 Binding How to automate? - More graph theory Compatibility Graph - Each node is an operation - Edges represent compatible operations + Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.) and are scheduled to different cycles

92 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults 5 and 6 not compatible (same cycle) 2 and 3 not compatible (same cycle)

93 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)

94 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)

95 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)

96 Compatibility Graph Binding: Find minimum number of fully connected subgraphs that cover entire graph - Well-known problem: Clique partitioning (NP-complete) + Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6 28 74 3 1 6 5

97 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Final Binding:

98 Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Alternative Final Binding:

99 Translation to Datapath * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 Mult(1,5)Mult(6)ALU(2,7,8,4)ALU(3) Mux Reg Mux a b c h Reg abcdefghi de i g e f 1)Add resources and registers 2)Add mux for each input 3)Add input to left mux for each left input in DFG 4)Do same for right mux 5)If only 1 input, remove mux

100 Left Edge Algorithm Alternative to clique partitioning - Take scheduled DFG, rotate it 90 degrees 2 ALUs (+/-), 2 Multipliers a b f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 Cycle4 Cycle3 Cycle7 c d e

101 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

102 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

103 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

104 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

105 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

106 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

107 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

108 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

109 Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

110 Extensions Algorithms presented so far find a valid binding - But, do not consider amount of steering logic required - Different bindings can require significantly different # of muxes One solution - Extend compatibility graph + Use weighted edges/nodes - cost function representing steering logic + Perform clique partitioning, finding the set of cliques that minimize weight

111 Binding Summary Binding maps operations onto physical resources - Determines sharing among resources Binding may greatly affect steering logic Trivial for fully-pipelined circuits - 1 resource per operation Straightforward translation from bound DFG to datapath

112 Summary

113 Main Steps Front-end (lexing/parsing) converts code into intermediate representation - We looked at CDFG Scheduling assigns a start time for each operation in DFG - CFG node start times defined by control dependencies - Resource allocation determined by schedule Binding maps scheduled operations onto physical resources - Determines how resources are shared Big picture: - Scheduled/Bound DFG can be translated into a datapath - CFG can be translated to a controller - => High-level synthesis can create a custom circuit for any CDFG!

114 Limitations Task-level parallelism - Parallelism in CDFG limited to individual control states + Can’t have multiple states executing concurrently - Potential solution: use model other than CDFG + Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes + High-level synthesis can create a controller+datapath for each process Must also consider communication buffers - Challenge: + Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code

115 Limitations Coding practices limit circuit performance - Very often, languages contain constructs not appropriate for circuit implementation + Recursion, pointers, virtual functions, etc. Potential solution: use specialized languages - Remove problematic constructs, add task-level parallelism Challenge: - Difficult to learn new languages - Many designers resist changes to tool flow

116 Limitations Expert designers can achieve better circuits - High-level synthesis has to work with specification in code + Can be difficult to automatically create efficient pipeline + May require dozens of optimizations applied in a particular order - Expert designer can transform algorithm + Synthesis can transform code, but can’t change algorithm Potential Solution: ??? - New language? - New methodology? - New tools?


Download ppt "High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida."

Similar presentations


Ads by Google