High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida.

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida

Existing FPGA Tool Flow Register-transfer (RT) synthesis - Specify RT structure (muxes, registers, etc) - Allows precise specification - But, time consuming, difficult, error prone Synthesizable HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing

Future FPGA Tool Flow HDL Netlist Bitfile Processor FPGA RT Synthesis Physical Design Technology Mapping Placement Routing High-level Synthesis C/C++, Java, etc. Synthesizable HDL

High-level Synthesis - Benefits Ratio of C to VHDL developers (10000:1 ?) Easier to specify complex designs Technology/architecture independent designs Manual HW design potentially slower -Similar to assembly code era -Programmers could always beat compiler -But, no longer the case Ease of HW/SW partitioning -enhance overall system efficiency More efficient verification and validation -Easier to V & V of high-level code

High-level Synthesis More challenging than SW compilation -Compilation maps behavior into assembly instructions -Architecture is known to compiler HLS creates a custom architecture to execute specified behavior -Huge hardware exploration space -Best solution may include microprocessors -Ideally, should handle any high-level code + But, not all code appropriate for hardware

High-level Synthesis: An Example First, consider how to manually convert high-level code into circuit Steps 1)Build FSM for controller 2)Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i];

A Manual Example Build a FSM (controller) - Decompose code into states acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

A Manual Example Build a datapath - Allocate resources for each state acc i acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 1 acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128 i >= 128

A Manual Example Build a datapath - Determine register inputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128 acc = 0; for (i=0; i < 128; i++) acc += a[i];

A Manual Example Build a datapath - Add outputs acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

A Manual Example Build a datapath - Add control signals acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 load a[i] acc += a[i] i++ Done i < 128 i >= 128

A Manual Example Combine controller + datapath acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller acc = 0; for (i=0; i < 128; i++) acc += a[i]; Start

A Manual Example - Optimization Alternatives -Use one adder (plus muxes) acc i < addr a[i] + 128 2x1 0 0 1 &a In from memory Memory address acc MUX

A Manual Example – Summary Comparison with high-level synthesis - Determining when to perform each operation => Scheduling - Allocating resource for each operation => Resource allocation - Mapping operations onto resources => Binding

Another Example: Try it at home Your turn x=0; for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --; a[i] = x; } //output x 1)Build FSM (do not perform if conversion) 2)Build datapath based on FSM

High-Level Synthesis Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc. Usually a RT VHDL/Verilog description, but could as low level as a bit file high-level code Custom Circuit

High-Level Synthesis – Overview High-Level Synthesis acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc i < addr a[i] + + + 1 128 2x1 0 0 1 &a In from memory Memory address acc Done Memory Read Controller

Main Steps Syntactic Analysis Optimization Scheduling/Resource Allocation Binding/Resource Sharing High-level Code Intermediate Representation Cycle accurate RTL code Converts code to intermediate representation - allows all following steps to use language independent format. Determines when each operation will execute, and resources used Maps operations onto physical resources Front-end Back-end

Syntactic Analysis Definition: Analysis of code to verify syntactic correctness -Converts code into intermediate representation Steps: similar to SW compilation 1)Lexical analysis (Lexing) 2)Parsing 3)Code generation – intermediate representation Syntactic Analysis High-level Code Intermediate Representation Lexical Analysis Parsing

Intermediate Representation Parser converts tokens to intermediate representation -Usually, an abstract syntax tree x = 0; if (y < z) x = 1; d = 6; Assign if cond assign x 0 x 1 d 6 y < z

Intermediate Representation Why use intermediate representation? - Easier to analyze/optimize than source code - Theoretically can be used for all languages + Makes synthesis back end language independent Syntactic Analysis C Code Intermediate Representation Syntactic Analysis Java Syntactic Analysis Perl Back End Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too

Intermediate Representation Different Types - Abstract Syntax Tree - Control/Data Flow Graph (CDFG) - Sequencing Graph +... We will focus on CDFG - Combines control flow graph (CFG) and data flow graph (DFG)

Control Flow Graphs (CFGs) Represents control flow dependencies of basic blocks A basic block is a section of code that always executes from beginning to end + I.e. no jumps into or out of block acc = 0; for (i=0; i < 128; i++) acc += a[i]; i < 128? acc=0, i = 0 acc += a[i] i ++ Done yes no

Control Flow Graphs: Your Turn Find a CFG for the following code. i = 0; while (i < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; i++; }

Data Flow Graphs Represents data dependencies between operations x = a+b; y = c*d; z = x - y; + * - a b c d x z y

Control/Data Flow Graph Combines CFG and DFG - Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0; i=0; acc += a[i] i ++ Done acc 0 i 0 + a[i] acc + i 1 i

Optimization

Synthesis Optimizations After creating CDFG, high-level synthesis optimizes it with the following goals -Reduce area -Improve latency -Increase parallelism -Reduce power/energy 2 types of optimizations -Data flow optimizations -Control flow optimizations

Tree-height reduction -Generally made possible from commutativity, associativity, and distributivity Data Flow Optimizations + + + + + + a b cd a b cd + + * a b cd + * + a b cd x = a + b + c + d

Operator Strength Reduction -Replacing an expensive (“strong”) operation with a faster one -Common example: replacing multiply/divide with shift Data Flow Optimizations b[i] = a[i] * 8; b[i] = a[i] << 3; a = b * 5; c = b << 2; a = b + c; 1 multiplication 0 multiplications a = b * 13; c = b << 2; d = b << 3; a = c + d + b;

Constant propagation -Statically evaluate expressions with constants Data Flow Optimizations x = 0; y = x * 15; z = y + 10; x = 0; y = 0; z = 10;

Function Specialization -Create specialized code for common inputs + Treat common inputs as constants + If inputs not known statically, must include if statement for each call to specialized function Data Flow Optimizations int f (int x) { y = x * 15; return y + 10; } for (I=0; I < 1000; I++) f(0); … } int f_opt () { return 10; } for (I=0; I < 1000; I++) f_opt(); … } Treat frequent input as a constant int f (int x) { y = x * 15; return y + 10; }

Data Flow Optimizations Common sub-expression elimination -If expression appears more than once, repetitions can be replaced a = x + y;...... b = c * 25 + x + y; a = x + y;...... b = c * 25 + a; x + y already determined

Data Flow Optimizations Dead code elimination -Remove code that is never executed + May seem like stupid code, but often comes from constant propagation or function specialization int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - “dead code”

Data Flow Optimizations Code motion (hoisting/sinking) -Avoid repeated computation for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ; } z = x + y; for (I=0; I < 100; I++) { b[i] = a[i] + z ; }

Control Flow Optimizations Loop Unrolling -Replicate body of loop + May increase parallelism for (i=0; i < 128; i++) a[i] = b[i] + c[i]; for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1] }

Control Flow Optimizations Function inlining – replace function call with body of function -Common for both SW and HW -SW: Eliminates function call instructions -HW: Eliminates unnecessary control states for (i=0; i < 128; i++) a[i] = f( b[i], c[i] );.. int f (int a, int b) { return a + b * 15; } for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;

Conditional Expansion – replace if with logic expression -Execute if/else bodies in parallel Control Flow Optimizations y = ab if (a) x = b+d else x = bd y = ab x = a(b+d) + a‘bd y = ab x = y + d(a+b) [DeMicheli] Can be further optimized to:

Example Optimize this x = 0; y = a + b; if (x < 15) z = a + b - c; else z = x + 12; output = z * 12;

Scheduling/Resource Allocation

Scheduling Scheduling assigns a start time to each operation in DFG -Start times must not violate dependencies in DFG -Start times must meet performance constraints +Alternatively, resource constraints Performed on the DFG of each CFG node - Cannot execute multiple CFG nodes in parallel

Examples + + + + + + a b cd a b cd + + + a b cd Cycle1 Cycle2 Cycle3 Cycle1 Cycle2 Cycle1 Cycle2

Scheduling Problems Several types of scheduling problems - Usually some combination of performance and resource constraints Problems: - Unconstrained + Not very useful, every schedule is valid - Minimum latency - Latency constrained - Mininum-latency, resource constrained + i.e. find the schedule with the shortest latency, that uses less than a specified # of resources + NP-Complete - Mininum-resource, latency constrained + i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources + NP-Complete

Minimum Latency Scheduling ASAP (as soon as possible) algorithm - Find a candidate node + Candidate is a node whose predecessors have been scheduled and completed (or has no predecessors) - Schedule node one cycle later than max cycle of predecessor - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 Minimum possible latency - 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm - Run ASAP, get minimum latency L - Find a candidate + Candidate is node whose successors are scheduled (or has none) - Schedule node one cycle before min cycle of successor + Nodes with no successors scheduled to cycle L - Repeat until all nodes scheduled + + * a b cd * - < e f g h Cycle1 Cycle2 Cycle3 + Cycle4 L = 4 cycles

Minimum Latency Scheduling ALAP - Has to run ASAP first, seems pointless - But, many heuristics need the mobility/slack of each operation + ASAP gives the earliest possible time for an operation + ALAP gives the latest possible time for an operation - Slack = difference between earliest and latest possible schedule + Slack = 0 implies operation has to be done in the current scheduled cycle + The larger the slack, the more options a heuristic has to schedule the operation

Latency-Constrained Scheduling Instead of finding the minimum latency, find latency less than L Solutions: - Use ASAP, verify that minimum latency less than L. - Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)

Schedule must use less than specified number of resources Scheduling with Resource Constraints + * a b cd + - e f g Cycle1 Cycle3 Cycle4 + Cycle5 + * Cycle2 Constraints: 1 ALU (+/-), 1 Multiplier

Scheduling with Resource Constraints Schedule must use less than specified number of resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * Constraints: 2 ALU (+/-), 1 Multiplier

Definition: Given resource constraints, find schedule that has the minimum latency - Example: Mininum-Latency, Resource-Constrained Scheduling a b c d e f g + + Constraints: 1 ALU (+/-), 1 Multiplier + - * +

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Assumes one type of resource Basic Idea - Input: graph, # of resources r - 1) Label each node by max distance from output + i.e. Use path length as priority - 2) Determine C, the set of scheduling candidates + Candidate if either no predecessors, or predecessors scheduled - 3) From C, schedule up to r nodes to current cycle, using label as priority - 4) Increment current cycle, repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Example + + * a b cd + - e f g + * j k - r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 1 - Label each node by max distance from output + i.e. use path length as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 - Determine C, the set of scheduling candidates a b cd e f g 4 4 3 2 3 2 1 j k 1 C = r = 3 Cycle 1

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Not scheduled due to lower priority Cycle1

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 2 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 C = Cycle1 Cycle 2

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Step 3 a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm - Skipping to finish a b cd e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2 Cycle3 Cycle4

Mininum-Latency, Resource-Constrained Scheduling Hu’s is simplified problem Common Extensions: - Multiple resource types - Multi-cycle operation a b cd + - / * Cycle1 Cycle2

Mininum-Latency, Resource-Constrained Scheduling List Scheduling - (minimum latency, resource- constrained version) - Extension for multiple resource types - Basic Idea - Hu’s algorithm for each resource type + Input: graph, set of constraints R for each resource type + 1) Label nodes based on max distance to output + 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors ) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t + 3) Increment cycle, repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency - Step 1 - Label nodes based on max distance to output (not shown, so you can see operations) - *nodes given IDs for illustration purposes a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 2 ALUs (+/-), 2 Multipliers

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Candidate, but not scheduled due to low priority

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency + For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to R t operations from C based on priority, to current cycle R t is the constraint on resource type t a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Mult ALU Cycle 1 2,3,4 1 5,6 4 2 7 3 2 ALUs (+/-), 2 Multipliers Candidates Cycle1 Cycle2

Mininum-Latency, Resource-Constrained Scheduling List scheduling - (minimum latency) - Final schedule - Note - ASAP would require more resources + ALAP wouldn’t but in general, it would a b cd e f g * + + * * + - j k - 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers

Mininum-Latency, Resource-Constrained Scheduling Extension for multicycle operations - Same idea (differences shown in red) - Input: graph, set of constraints R for each resource type - 1) Label nodes based on max cycle latency to output - 2) For each resource type t + 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors) + 4) Schedule up to (R t - n t ) operations from C based on priority, one cycle after predecessor R t is the constraint on resource type t n t is the number of resource t in use from previous cycles - Repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling Example: a b cd e f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 2 ALUs (+/-), 2 Multipliers Cycle4 Cycle3 Cycle7

List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult) Steps (will be on test) - 1) Label nodes with priority - 2) Update candidate list for each cycle - 3) Redraw graph to show schedule + * * - + + 1 2 3 7 4 9 - + + * 5 6 8 - 10 11

List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult, Mults take 2 cycles) a b cd e f g + + * * * + 1 2 3 5 4 6

Minimum-Resource, Latency-Constrained Note that if no resource constraints given, schedule determines number of required resources - Max # of each resource type used in a single cycle + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs 2 Mults

Minimum-Resource, Latency-Constrained Minimum-Resource Latency-Constrained Scheduling: - For all schedules that have latency less than the constraint, find the one that uses the fewest resources + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * + + * a b cd + - e f g Cycle1 Cycle2 Cycle3 + Cycle4 * 3 ALUs, 2 Mult 2 ALUs, 1 Mult Latency Constraint <= 4

Minimum-Resource, Latency-Constrained List scheduling (Minimum resource version) Basic Idea - 1) Compute latest start times for each op using ALAP with specified latency constraint + Latest start times must include multicycle operations - 2) For each resource type + 3) Determine candidate nodes + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle + 5) Schedule ops with 0 slack Update required number of resources (assume 1 of each to start with) + 6) Schedule ops that require no extra resources - 7) Repeat from 2) until all nodes scheduled

Minimum-Resource, Latency-Constrained 1) Find ALAP schedule a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Last Possible Cycle a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Defines last possible cycle for each operation Latency Constraint = 3 cycles

Minimum-Resource, Latency-Constrained 2) For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 00020002 Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Initial Resources = 1 Mult, 1 ALU

Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {1,2,3,4} Cycle 0 1 2 X Resources = 1 Mult, 2 ALU 4 requires 1 more ALU - not scheduled Cycle 1 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 0 Resources = 1 Mult, 2 ALU Cycle 2 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack - Update required number of resources 6) Schedule ops that require no extra resources Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {4,5,6} Cycle 1 1 2 0 2 Resources = 2 Mult, 2 ALU Cycle 2 Already 1 ALU - 4 can be scheduled a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

Minimum-Resource, Latency-Constrained 2)For each resource type + 3) Determine candidate nodes C + 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Candidates = {7} Cycle 1 2 0 Resources = 2 Mult, 2 ALU Cycle 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7

Minimum-Resource, Latency-Constrained Final Schedule Node LPC 1 2 1 3 1 4 3 5 2 6 2 7 3 Slack Cycle 1 2 3 a b cd e f g * + + * * - j k - 1 2 3 4 5 6 7 Cycle1 Cycle2 Cycle3 Required Resources = 2 Mult, 2 ALU

Other extensions Chaining - Multiple operations in a single cycle Pipelining - Input: DFG, data delivery rate - For fully pipelined circuit, must have one resource per operation (remember systolic arrays) a b cd + + + - e f / Multiple adds may be faster than 1 divide - perform adds in one cycle

Summary Scheduling assigns each operation in a DFG a start time - Done for each DFG in the CDFG Different Types - Minimum Latency + ASAP, ALAP - Latency-constrained + ASAP, ALAP - Minimum-latency, resource-constrained + Hu’s Algorithm + List Scheduling - Minimum-resource, latency-constrained + List Scheduling

Binding/Resource Sharing

Binding During scheduling, we determined: - When operations will execute - How many resources are needed We still need to decide which operations execute on which resources – binding - If multiple operations use the same resource, we need to decide how resources are shared - resource sharing.

Binding Map operations onto resources such that operations in same cycle do not use same resource * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2Mult2

Binding Many possibilities - Bad binding may increase resources, require huge steering logic, reduce clock, etc. * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers Mult1 ALU1 ALU2 Mult2

Binding Can’t do this - 1 resource can’t perform multiple ops simultaneously! * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 2 ALUs (+/-), 2 Multipliers

Binding How to automate? - More graph theory Compatibility Graph - Each node is an operation - Edges represent compatible operations + Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.) and are scheduled to different cycles

Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults 5 and 6 not compatible (same cycle) 2 and 3 not compatible (same cycle)

Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)

Compatibility Graph Binding: Find minimum number of fully connected subgraphs that cover entire graph - Well-known problem: Clique partitioning (NP-complete) + Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6 28 74 3 1 6 5

Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Final Binding:

Compatibility Graph * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 28 74 3 1 6 5 ALUs Mults Alternative Final Binding:

Translation to Datapath * + + * * + - - 1 23 4 5 6 7 8 Cycle1 Cycle2 Cycle3 Cycle4 Mult(1,5)Mult(6)ALU(2,7,8,4)ALU(3) Mux Reg Mux a b c h Reg abcdefghi de i g e f 1)Add resources and registers 2)Add mux for each input 3)Add input to left mux for each left input in DFG 4)Do same for right mux 5)If only 1 input, remove mux

Left Edge Algorithm Alternative to clique partitioning - Take scheduled DFG, rotate it 90 degrees 2 ALUs (+/-), 2 Multipliers a b f g * + + * * + - j k * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 Cycle4 Cycle3 Cycle7 c d e

Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * + + * * + - * 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6Cycle4 Cycle3 Cycle7 1)Initialize right_edge to 0 2)Find a node N whose left edge is >= right_edge 3)Bind N to a particular resource 4)Update right_edge to the right edge of N 5)Repeat from 2) for nodes using the same resource type until right_edge passes all nodes 6)Repeat from 1) until all nodes bound right_edge

Extensions Algorithms presented so far find a valid binding - But, do not consider amount of steering logic required - Different bindings can require significantly different # of muxes One solution - Extend compatibility graph + Use weighted edges/nodes - cost function representing steering logic + Perform clique partitioning, finding the set of cliques that minimize weight

Binding Summary Binding maps operations onto physical resources - Determines sharing among resources Binding may greatly affect steering logic Trivial for fully-pipelined circuits - 1 resource per operation Straightforward translation from bound DFG to datapath

Summary

Main Steps Front-end (lexing/parsing) converts code into intermediate representation - We looked at CDFG Scheduling assigns a start time for each operation in DFG - CFG node start times defined by control dependencies - Resource allocation determined by schedule Binding maps scheduled operations onto physical resources - Determines how resources are shared Big picture: - Scheduled/Bound DFG can be translated into a datapath - CFG can be translated to a controller - => High-level synthesis can create a custom circuit for any CDFG!

Limitations Task-level parallelism - Parallelism in CDFG limited to individual control states + Can’t have multiple states executing concurrently - Potential solution: use model other than CDFG + Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes + High-level synthesis can create a controller+datapath for each process Must also consider communication buffers - Challenge: + Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code

Limitations Coding practices limit circuit performance - Very often, languages contain constructs not appropriate for circuit implementation + Recursion, pointers, virtual functions, etc. Potential solution: use specialized languages - Remove problematic constructs, add task-level parallelism Challenge: - Difficult to learn new languages - Many designers resist changes to tool flow

Limitations Expert designers can achieve better circuits - High-level synthesis has to work with specification in code + Can be difficult to automatically create efficient pipeline + May require dozens of optimizations applied in a particular order - Expert designer can transform algorithm + Synthesis can transform code, but can’t change algorithm Potential Solution: ??? - New language? - New methodology? - New tools?

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida.

Similar presentations

Presentation on theme: "High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida.

Similar presentations

Presentation on theme: "High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida."— Presentation transcript:

Similar presentations

About project

Feedback