Download presentation
Presentation is loading. Please wait.
1
High-Level Synthesis for Reconfigurable Systems
2
2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level synthesis and temporal partitioning 1.List-scheduling-based approach 2.Exact methods: Integer linear programming 3.Network-flow-based TP
3
3 High-Level Synthesis New Computing Platforms: Its success depends on the easiness of programming −Microprocessors −DSPs Parallel Computing: Why not as successful as sequential programming? Parallel implementation of a given application requires two steps: 1.The application must be partitioned to identify the dependencies among the parts (to set the parts that can be executed in parallel). −Requires a good knowledge on the structure of the application. 2.Mapping must allocate the independent blocks of the application to the computing hardware resources. −Requires a good understanding of the underlying hardware structure. Reconfigurable computing: Faces similar problem: −Must write code in an HDL −Not easy to decouple the implementation process from the hardware High-level descriptions increase the acceptance of a platform
4
4 Modeling High-level descriptions: Modeling is a key aspect in the design of a system Models used must be powerful enough −Capture all user’s need −Easy to understand and manipulate Several powerful models exists: FSM, State Charts, DFG, Petri Nets, …… Focus in this course: Dataflow graphs, Sequencing graphs, Finite State Machine with Datapath (FSMD)
5
5 Dataflow Graph DFG: Means to describe a computing task in a streaming mode. Operators: nodes Operands: Inputs of the nodes Node’s output can be used as input to other nodes −Data dependency in the graph. Given a set of tasks {T 1,…,T k }, a DFG is a DAG G(V,E), where V (= T): the set of nodes representing operators and E: the set of edges representing data dependency between tasks. DFG for the quadratic root computation using:
6
6 Definitions The latency t i of v i : −The time it takes to compute the function of v i using H vi node v i V hihi lili a i = l i × h i implementation H vi vivi vivi vjvj w ij Weight w ij of e ij E: −width of bus connecting two components H vi and H vj Latency t ij of e ij : −the time needed to transmit data from H vi to H vj Simplification: we will just use v i for a node as well as for its hardware implementation H vi.
7
7 DFG Any high-level program can be compiled into a DFG, provided that: −No loops −No branching instructions. Therefore: DFG is used for parts of a program. Extend to other models: −Sequencing graph −FSMD −CDFG
8
8 Sequencing Graph Sequencing Graph: Hierarchical DFG with two different types of nodes: 1.Operation nodes: normal ”task nodes” in a DFG 2.Link nodes or branching nodes: point to another sequencing graph in a lower level of the hierarchy. Properties: Linking nodes evaluate conditional clauses. If placed at the tail of alternative paths, correspond to possible branches.
9
9 Sequencing Graph Example: According to the conditions that node BR evaluates, one of the two sub-sequencing graphs (1 or 2) can be activated. Loop description: −Body of the loop is described in only one sub-sequencing graph: BR evaluates the loop exit condition
10
10 Finite State Machine with Datapath (FSMD) FSMD: Extension of DFG with an FSM is a 6-tuple : S = {s 0, …, s l }: states, I = {i 0, …, i m }: inputs, O = {o 0, …, o n }: outputs, F: S × I × O S: a transition function that maps a tuple (s i, i j,o k ) to a state, H: S O: an action function that maps the current state to output, s 0 : an initial state. FSMD vs. FSM FSMD operates on arbitrary complex data types −Not just Boolean vars Transition may include arithmetic operations
11
11 FSMD 1.Assignment statement a single state is created that executes the assignment action. An arc connecting the state with the state of the next statement is created. Modeling with FSMD: The transformation of a program into a FSMD is done by −Transforming the statements of the program into FSMD states. −The statements are first classified in three categories: 1.assignment statements, 2.branch statements and 3.loop statements
12
12 FSMD: Loop 2.Loop statement: A condition state C and a join state J, both with no action are created. an arc is added: −label: the loop condition −connects C to the state of the first statement in the loop body. another arc is added: −label: complement of the loop condition −connects C to the first statement after the loop body. an edge is added: −connects: the state of the last statement in the loop to the join state another edge is added: −connects: join state back to the conditional state.
13
13 FSMD: Branch 3.Branch statement: a condition state C and a join state J, both with no action are created. an arc is added: −label: the first branch condition −connects C to the state of the first statement of the branch. another arc is added: −label: complement of the first condition ANDed with the second branch condition −connects C to the first statement of the branch. …. Each state corresponding to the last statement in a branch is connected to the join state. Join state is connected to the state corresponding to the first statement after the branch.
14
14 FSMD: Example
15
15 FSMD: Datapath Synthesis Datapath Synthesis Steps: 1.A register for each variable. 2.An FU for each arithmetic operation. 3.Connecting FU ports to registers. 4.Creating a unique identifier for each control input/output of FUs. 5.(Resource sharing by MUXes).
16
16 High-Level Synthesis High-Level Synthesis (Architectural Synthesis): Transforming an abstract model of circuit behavior into a data path and a control unit. Steps: 1.Allocation: defines the resource types required by the design, and for each type the number of instances. 2.Binding: maps each operation to an instance of a given resource. 3.Scheduling: sets the temporal assignment of resources to operators. −Decides which operator owns the resource at a given time
17
17 Allocation Allocation (Formal Definition): For a given specification with a set of operators or tasks T = {t 1, t 2, · · ·, t n } to be implemented on a set of resource types R = {r 1, r 2, · · ·, r t }, allocation is a function α : R → Z +, where α(r) = z i is the number of available instances of resource type r i
18
18 Allocation Example: T = {*, *, *, -, -, *, *, *, +, +, <} R = {ALU, MUL} = {1,2} −ALU: add, sub, compare α(1) = 5 α(2) = 6
19
19 Allocation Another Definition: α : T → Z +, −Operators in the specification are mapped to resource types. α(t 1 ) = 2 α(t 2 ) = 2 … α(t 5 ) = 1 … α(t 11 ) = 1 More than one resource type may exists for an operator: −{CLA, RipAdd, BoothMul, SerMul,..} − Resource selection needed (α is one-to-many)
20
20 Binding Binding (Formal Definition): For T = {t 1, t 2, · · ·, t n } and R = {r 1, r 2, · · ·, r t }, binding is a function β : T → R × Z +, where β(t i ) = (r i, b i ), (1 ≤ b i ≤ α(r i )) is the instance of the resource type r i on which t i is mapped to.
21
21 Binding Example: T = {*, *, *, -, -, *, *, *, +, +, <} R = {ALU, MUL} = {1,2} −ALU: add, sub, compare β(t 1 ) = (2,1) β(t 2 ) = (2,2) β(t 3 ) = (2,3) β(t 4 ) = (1,1) β(t 5 ) = (1,2) β(t 6 ) = (2,4) β(t 7 ) = (2,5) β(t 8 ) = (2,6) … β(t 11 ) = (1,5)
22
22 Scheduling Scheduling (Formal Definition): For T = {t 1, t 2, · · ·, t n } scheduling is a function ς : V → Z +, where ς(t i ) denotes the starting time of task t i.
23
23 High-Level Synthesis Fundamental differences in RCS: 1. Uniform resources: − It is possible to implement any task on a given part of a device (provided that the available resource are enough).
24
24 General vs. RCS High-Level Synthesis Example: Assumptions on “resource fixed” device: Multiplication needs 100 basic resource unit The adder and the subtractor need 50 units each. Allocation selects “one” instance of each resource type. − Two subtractors cannot be used in the first level. The adder cannot be used in the first step −due to data dependency Minimum execution time: 4 steps add sub mul * x *
25
25 General vs. RCS High-Level Synthesis Assumptions on a reconfigurable device Multiplication needs 100 LUTs. Adder/subtractor need 50 LUTs each. Total available amount of resources: 200 LUTs. The two subtractors can be assigned in the first step. Minimum execution time: 3 steps * *
26
26 High-Level Synthesis Fundamental differences in RCS: 2. In general HLS: Application is specified using a structure that encapsulates a datapath and a control part. Synthesis process allocates the resources to operators at different time according to a computed schedule. Control part is synthesized. In RCS: Hardware modules implemented as datapath normally compete for execution on the chip. A processor is used to control selection process of the hardware modules by means of reconfiguration. The same processor is also in charge of activating the single resources in the corresponding hardware accelerators.
27
27 Temporal Partitioning Resources on the device are not allocated to only one operator but to a set of operators that must be placed at the same time and removed. An application must be partitioned in sets of operators. The partitions will then be successively implemented at different time on the device. Temporal Partitioning
28
28 Configuration Configuration: Given a reconfigurable processing unit H and a set of tasks T = {t 1,...., t n } available as cores C = {c 1,...., c n }, we define the configuration ζ i of the RPU at time s i to be the set of cores ζ i = {c i1,..., c ik } C running on H at time s i. A core (module) c i for each t i in library: Hard / soft / firm module.
29
29 Firm vs. Hard Modules Placement of firm modules Placement of hard modules
30
30 Schedule Schedule: is a function ς : V → Z +, where ς(v i ) denotes the starting time of the node v i that implements a task t i. Feasible Schedule: ς is feasible if: e ij = (v i, v j ) E, ς(t j ) ≥ ς(t i ) + T(t i ) + t ij −e ij defines a data dependency between tasks t i and t j, −t ij is the latency of the edge e ij, −T(t i ) is the time it takes the node v i to complete execution.
31
31 Ordering Relation Ordering relation ≤ v i ≤ v j schedule ς, ς(v i ) ≤ ς(v j ). −≤ is a partial ordering, as it is not defined for all pairs of nodes in G.
32
32 Partition Partition: A partition P of the graph G = (V,E) is its division into some disjoint subsets P 1,..., P m such that U k=1,…,m P k = V Feasible Partition: A partition is feasible in accordance to a reconfigurable device H with area a(H) and pin count p(H) if: P k P: a(P k ) = (∑ vi Pk a i ) ≤ a(H) 1/2∑ eij E w ij ≤ p(H) −for e ij = crossing edges Crossing edge: an edge that connects one component in a partition with another component out of the partition.
33
33 Run Time Run time of a partition r(P i ): the maximum time from the input of the data to the output of the result.
34
34 Ordering Relation Ordering relation for partitions: P i ≤ P j v i P i, v j P j −either v i ≤ v j −or v i and v j are not in relation. Ordered partitions: A partitioning P is ordered an ordering relation ≤ exists on P. If P is ordered, then for a pair of partitions, one can always be implemented after the other with respect to any scheduling relation.
35
35 Temporal Partitioning Temporal partitioning: Given a DFG G = (V,E) and a reconfigurable device H, a temporal partitioning of G on H is an ordered partitioning P of G with respect to H.
36
36 Cycles are not allowed in DFG. − Otherwise, the resulting partition may not be schedulable on the device. Cycle Temporal Partitioning
37
37 Goal: Computation and scheduling of a Configuration graph A configuration graph is a graph in which: Nodes are partitions or bitstreams Edges reflect the precedence constraints in a given DFG Configuration Graph P1P1 P2P2 P3P3 P4P4 P5P5 Temporal partitioning Formal Definition: Given a DFG G = (V,E) and a temporal partitioning P = {P1,..., Pn} of G, we define a Configuration graph of G relative to the P, with notation Γ(G/P) = (P,E P ) in which the nodes are partitions in P. An edge e P = (P i, P j ) E P e = (v i, v j ) E with v i P i and v j P j. Configuration: For a given partition P, each node P i P has an associated configuration ζ i that is the implementation of P i for the given device H.
38
38 Whenever a new partition is downloaded, the partition that was running is destroyed. Communication through inter-configuration registers (or communication memory) − May sit in main memory − May sit at the boundary of the device to hold the input and output values Configuration sequence is controlled by the host processor P1P1 P2P2 P3P3 P4P4 P5P5 Inter-configuration registers Temporal partitioning IO Register Processor Bus Block IO Register FPGA Device’s register mapping into the processor address spaces
39
39 Steps (for P i and P j, (P i ≤ P j ) : 1.Configuration for P i is first downloaded into the device. 2.Executes. 3.P i copies all the data it needs to send to other partitions into the communication memory. 4.The device is reconfigured to implement the partition P j 5.Accesses the communication memory and collect the data. P1P1 P2P2 P3P3 P4P4 P5P5 Inter-configuration registers Temporal partitioning
40
40 Objectives for optimization: 1. # interconnections: very important, since it minimizes: The mount of exchanged data The amount of memory for temporally storing the data 2. # produced blocks (partitions) Reduces the number of reconfigurations (total time?) 3. Overall computation delay depends on the partition run time the processor used for reconfiguration speed of data exchange 4. Similarity between consecutive partitions (for partial) 5. Overall amount of wasted resources on the chip. When components with shorter run-times are placed in the same partition with other components with longer run-time, those with the shorter components remain idle for a longer period of time. Temporal partitioning
41
41 Wasted Resources Wasted resource wr(v i ) of a node v i : Unused area occupied by the node v i during the computation of a partition wr(v i ) = (t(P i )−T(t i ))×a i t(P i ): run-time of partition P i. T(t i )): run-time of the component v i a i : area of v i Wasted resource wr(P i ) of a partition (P i = {v i1,.., v in }: wr(P i ) = j =1,…,n wr(v i ) Wasted resource of a partitioning P: wr(P) = j =1,…,k wr(P j ) Run time Area
42
42 Communication Cost: modelled as graph connectivity: Connectivity of a graph G=(V,E): con(G) = 2*|E|/(|V| 2 - |V|) |V| 2 - |V|: the number of all edges that can be built with V. 1 2 3 4 5 6 8 79 1010 Connectivity = 0.24 Communication Overhead
43
43 Quality of Partitioning P = {P1,…,Pn}: Average connectivity over P: Q(P) = 1/n i=1,…,n con(P i ) 4 5 1 2 8 7 9 10 36 Quality = 0.25 1 3 4 5 6 2 8 79 10 Quality = 0.45 1 2 3 4 5 6 8 79 1010 Connectivity = 0.24 Communication Overhead
44
44 Communication Overhead Minimizing communication overhead by minimizing the weighted sum of crossing edges among the partitions. − minimize the size of the communication memory and − minimize the communication time. Heuristic: Highly connected components are placed in the same partition (High quality partitioning)
45
45 References [Bobda07] Christophe Bobda, “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications,” Springer, 2007.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.