Reconfigurable Computing

Reconfigurable Computing
Dr. Christophe Bobda CSCE Department University of Arkansas

Chapter 4 High-Level Synthesis for Reconfigurable Devices

Agenda Modeling High-level synthesis and Temporal partitioning
Dataflow graphs Sequencing graphs Finite State Machine with Datapath High-level synthesis and Temporal partitioning The List-Scheduling-based approach Exact methods: Integer Linear Programming Network-flow-based TP

1. Modeling The success of a hardware platform depends on the easiness of programming Micro processors DSPs High-level descriptions increase the acceptance of a platform Modeling is a key aspect in the design of a system Models used most be powerful enough Capture all user’s need Describe all system parameters Easy to understand and manipulate Several powerful models exists (FSM, State Charts, DFG, Petri Nets, etc…) We focus in this course on two models: Dataflow graphs, sequencing graphs and the Finite State Machine with Datapath(FSMD)

1. Definitions Given a node and its implementation as rectangular shape module in hardware. denotes the length and the height of denotes the area The latency of is the time it takes to compute the function of using the module For a given edge , is the weight of , i.e the width of bus connecting two components and The latency of is the time needed to transmit data from to Simplification: we will just use for a node as well as for its hardware implementation

1.1 Dataflow Graph Means to describe a computing task in a streaming mode. Operators represent the nodes Inputs of the nodes are the operand Node’s output can be used as input to other nodes Data dependency in the graph. Given a set of tasks a dataflow graph (DFG) is a directed acyclic graph , where is the set of nodes representing operators and E is the set of edges: An edge is defined through the (data) dependency between task and task . Dataflow graph for the quadratic root computation using:

1.2 Sequencing graph Hierarchical dataflow graph with two different types of nodes: The operation nodes corresponding to normal ”task nodes” in a dataflow graph The link nodes or branching nodes that point to another sequencing graph in a lower level of the hierarchy. Linking nodes evaluate conditional clauses. placed at the tail of alternative paths corresponding to possible branches. Loops can be modeled by using a branching as a tail of two paths one for the exit from the loop the other for the return to the body of the loop

1.2 Sequencing graph Example:
According to the conditions that node BR evaluates, one of the two subsequencing graphs (1 or 2) can be activated. Loop description: body of the loop is described in only one subsequencing graph: BR evaluates the loop exit condition

1.3 Finite State Machine with Datapath (FSMD)
Extension of dataflow graph with a finite state machine Formally, An FSMD is a 7-tuple where: is a set of states, is a set of inputs, is a set of outputs, is a set of variables, is a transition function that maps a tupel (state, input variable, output variable) to a state, is an action function that maps the current state to output and variable, is an initial state. FSMD vs FSM FSMD operates on arbitrary complex data type Transition may include arithmetic operations

Modeling with FSMD: The transformation of a program into a FSMD is done by Transforming the statements of the program into FSMD states. The statements are first classified in three categories: the assignment statements, the branch statements and the loop statements For an assignment statement a single state is created that executes the assignment action. An arc connecting the so created state with the state corresponding to the next program statement is created.

For a loop statement a condition state C and a join state J, both with no action are created. An arc, labeled with the loop condition and connecting the conditional state C with the state corresponding to the first statement in the loop body is added to the FSMD. Accordingly another arc is added from C to the state corresponding to the first statement after the loop body. This arc is labeled with the complement of the loop condition. Finally an edge is added from the state corresponding to the last statement in the loop to the join state and another edge is added from the join state back to the conditional state.

For a branch statement, a condition state C and a join state J both with no action are created. An arc is added from the conditional state to the first statement of the branch. This branch is labeled with the first branch condition. A second arc, labeled with the complement of the first condition ANDed with the second branch condition is added from the conditional state to the first statement of the branch. This process is repeated until the last branch condition. Each state corresponding to the last statement in a branch is then connected to the join state. The join state is finally connected to the state corresponding to the first statement after the branch.

2. High-Level synthesis Next: compilation (synthesis) onto a set of hardware components Usually done in three different steps. The allocation defines the number of resource types required by the design, and for each type the number of instances. The binding maps each operation to an instance of a given resource. The scheduling sets the temporal assignment of resources to operators Fundamental difference in reconfigurable computing Only Binding and scheduling are important common high-level synthesis. Resources can be create as wanted in RC

2. High-Level synthesis Dataflow graph of the functions and
Assumptions on “resource fixed” device Multiplication needs 100 basic resource (ex 100 NAND-Gates or 100 LUTs). The adder and the subtractor need 50 LUTs each. Despite the availability of resources, two subtractors cannot be used in the first level The adder cannot be used in the first step, it depends on a subtractor that must be first executed Minimum execution time is 4 steps

2. High-Level synthesis Dataflow graph of the functions and
Assumptions on a reconfigurable device Multiplication needs 100 basic resource (ex 100 NAND-Gate or 100 LUTs). The adder and the subtractor need 50 LUTs each. Total available amount of resources: 200 LUTs. The two subtractors can be assigned in the first step. Minimum execution time: 3 steps

2. Temporal partitioning – Problem definition
Temporal partitioning: Given: Single device temporal partitioning of a DFG G=(V,E) for a device R A temporal partition can also be defined as an ordered partition of G with the constraints imposed by R. With the ordering relation imposed on the partition, we reduce the solution space to only those partitions which can be scheduled on the device for execution. Therefore, cycles are not allowed in the dataflow graph. Otherwise, the resulting partition may not be schedulable on the device Cycle

2. Temporal partitioning
Goal: Computation and scheduling of a Configuration graph A configuration graph is a graph in which: Nodes are partitions or bitstreams Edges reflect the precedence constraints in a given DFG Communication through inter-configuration registers configuration sequence is controlled by the host processor On configuration, saved register values. After reconfiguration, copy values back to registers P1 P2 P3 P4 P5 Inter-configuration registers A configuration graph IO Register Bus IO Register IO Register Block IO Register IO Register IO Register IO Register IO Register Processor FPGA Device’s register mapping into the processor address spaces

1 2 3 4 5 6 8 7 9 10 Connectivity = 0.24 Objectives: The following objectives are followed: Minimize the number of interconnection: This is one of the most important, since it will minimize : the amount of exchanged data The amount of memory for temporaly storing the data Minimize the number of produced blocks Minimize the overal computation delay Quality of the result: means to measure how good an algorithm performs Connectivity of a graph G=(V,E): con(G) = 2*|E|/(|V|2 - |V|) Quality of Partitioning P = {P1,…,Pn}: Average connectivity over P High quality means the algorithm performs well. Low quality means that the algorithm performs poor. 4 5 1 2 8 7 9 10 3 6 Quality = 0.25 1 3 4 5 6 2 8 7 9 10 Quality = 0.45

Scheduling: Given is a DFG and an architecture which is a set of processing element Compute the starting time of each node on a given resource Temporal partitioning: Given is a DFG and an a reconfigurable device The starting time of each node is the starting time of the partition to which it belongs Compute the starting time of each node onto the device Solution approaches: List scheduling Integer Linear Programming Network Flow Spectral method

2.1 Unconstrained Scheduling
ASAP (as soon as possible) Defines the earliest starting time for each node in the DFG Compute a minimal latency ALAP (as late as possible) Defines the latest starting time for each node in the DFG according to a given latency The mobility of a node ist difference betwen the ALAP-starting time and ASAP-starting time Mobility is 0  node is on a critical path

2.1 ASAP Example Time 0 * * * * + Time 1 * * + < Time 2 - Time 3 -
Unconstrained scheduling with optimal latency : L = 4 Zeit 0 Time 0 * * * * + Time 1 * * + < Time 2 - Zeit 3 Time 3 - Zeit 3 Time 4 Zeit 4 Zeit 4

2.1 ASAP Algorithm ASAP(G(V,E),d) { FOREACH( vi without predecessor)
 s(vi) := 0; REPEAT { choose a node vi , which predecessors are all planed; s(vi) := maxj:(vj,vi)E {s(vj)+ dj}; } UNTIL (all nodes vi are planned); RETURN s

2.1 ALAP-Example Time 0 * * Time 1 * * Time 2 - * * + Time 3 - + <
Unconstrained scheduling with optimal latency : L = 4 Time 0 * * Zeit 1 Time 1 * * Time 2 - * * + Zeit 3 Time 3 - + < Zeit 4 Time 4 Zeit 4

2.1 Mobility Time 0 * - * 1 2 * + < Time 1 * Time 2 + < * Time 3
Zeit 0 Zeit 1 Zeit 2 Zeit 3 Zeit 4 Time 0 * - * 1 2 * + < Time 1 * Time 2 + < * Time 3 Time 4

2.1 ALAP-Algorithm ALAP(G(V,E),d, L) { FOREACH( vi without successor)
s(vi) := L - di; REPEAT { Choose a node vi , which successors are all planed; s(vi) := minj:(vi,vj)E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s

2.1 Constrained scheduling
Extended ASAP, ALAP Compute ASAP or ALAP Assign the tasks earlier (ALAP) or later (ASAP), untill the resource constraints are fullfiled Listscheduling A list L of ready to run tasks is created Tasks are placed in L in decreasing priority order At a given step, the free resource is assigned the task with highest priority. Criteria can be: number of successors, mobility, connectivity, etc.

2.1 Extended ASAP, ALAP 2 Multiplizierer, 2 ALUs (+, , <) * * * *
Time 0 * * Time 1 * * + Time 2 - * * < Time 3 - + Time 4

Criterium: number of successors Ressource: 1 multiplier, 1 ALU (+, , <) 3 2 1 * + - <

Time 0 * + Time 1 * < Time 2 * Time 3 * - Time 4 * Time 5 - * Time 6 + Time 7

2.1 Temporal partitioning vs constrained scheduling
List scheduling is used to keep an ordering between the nodes List Scheduling (LS) partitioning Construct a list L of all nodes with priority Create a new empty partition Pact Remove a node from the list and place it in the partition If size(Pact) <= size(R) and T(Pact) <= T(R) go to 2.1, else go to 2.3 If empty(list), stop else goto 2

Criterium: number of successors size(FPGA) = 250, size (mult) = 100, size(add) = size(sub) = 20 size(comp) = 10 3 2 1 * + - <

+ < * P3 - 3 2 1 * + - < Connectivity: c(P1) = 1/6, c(P2) = 1/3, c(P3) = 2/6 Quality: 0.83

+ < * P3 - 3 2 1 * + - < Connectivity: c(P1) = 2/10, c(P2) = 2/3, c(P3) = 2/3 Quality: 1.2 Connectivity is better

ASAP Processing in topological order Assigning level-number to the nodes Scheduling of nodes according to level number Drawback “Levelization”: Assigned of nodes to partitions based level-number (increase data exchange) Advantage Fast (linear run-time) Local optimization possible * + / * Level 0 + - * Level 1 - / Level 2 Level 3

Local optimization by configuration switching If two consecutive partitions P1 and P2 share a common set of operators, then: Implement the minimal set of operators needed for the two partitions Use signal multiplexing to switch from one partition to the next one. Drawbacks: More resources are needed to implement the signals switching Advantages: reconfiguration time is reduced Device's operation is not interrupted

Configuration 1 b c j h Add Mult g Sub a i d f e b c d e a Mult Add Add h Sub g i f Add Sub Mult j Inter configuration register Configuration 2

Improved List Scheduling algorithm Generate the list of nodes node_list Build a first partition P1 while(!node_list.empty( )) build a new partition P2 If union (P1, P2) fits on the device, then implement configuration switching with P1 and P2 else set P1 = P2 and goto 3 Exit

2.2 Temporal partitioning – ILP
With the ILP (Integer Linear Programming), the temporal partitioning constraints are formulated as equations. The equations are then solved using an ILP-solver. The constraints usually considered are: Uniqueness constraint Temporal order constraint Memory constraint Resource constraint Latency constraint Notations:

2.2 Temporal partitioning – ILP
Unique assignment constraint: Each task must be placed in exactly one partition. Precedence constraint: for each edge in the graph, must be placed either in the same partition like or in a earlier partition than that in which is placed. Resource constraint: The sum of the resource need to implement the modules in one partition should not exceed the total amount of available resources. Device area constraint: Device terminal constraints:

2.3 Temporal partitioning – Network-flow
Recursive bipartitioning The goal at each step is the generation of a unidirectional bipartition The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. Network flow methods are used to compute the a bipartition with minimal edge-cut size. Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut

2.3 Temporal partitioning – Network-flow
Two-terminal net transformation Replace an edge (v1, v2) with two edges (v1, v2) with capacity 1 and (v2, v1) with infinite capacity Multi-terminal net transformation For a multi-terminal net {v1, v2, .....v2}, Introduce a dummy node v with no weight and a briging (v1, v) with capacity 1. Introduces the egdes (v, v2), .... (v, vn), each of which is assigned a capacity 1. Introduce the edges (v2, v1), ..., (vn, v1), each of which is assigned an infinite capacity Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G.

Reconfigurable Computing

Similar presentations

Presentation on theme: "Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Computing

Similar presentations

Presentation on theme: "Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback