Reconfigurable Computing

Slides:



Advertisements
Similar presentations
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
Improving Code Generation Honors Compilers April 16 th 2002.
ICS 252 Introduction to Computer Design
ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
ALGORITHMS AND FLOWCHARTS. Why Algorithm is needed? 2 Computer Program ? Set of instructions to perform some specific task Is Program itself a Software.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
George Mason University Finite State Machines Refresher ECE 545 Lecture 11.
Scheduling with Constraint Programming
Scheduling Determines the precise start time of each task.
Register Transfer Specification And Design
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Optimizing Compilers Background
ALGORITHMS AND FLOWCHARTS
Chap 7. Register Transfers and Datapaths
The minimum cost flow problem
Lectures on Network Flows
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Chapter 6: CPU Scheduling
Chapter 5. Optimal Matchings
Introduction to cosynthesis Rabi Mahapatra CSCE617
Reconfigurable Computing
Reconfigurable Computing
Instruction Scheduling Hal Perkins Summer 2004
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 5: CPU Scheduling
Instruction Scheduling Hal Perkins Winter 2008
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Chapter5: CPU Scheduling
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Architecture Synthesis
Resource Sharing and Binding
Computer Architecture
Instruction Scheduling Hal Perkins Autumn 2005
ICS 252 Introduction to Computer Design
Chapter 6: CPU Scheduling
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Module 5: CPU Scheduling
Fast Min-Register Retiming Through Binary Max-Flow
Chapter 6: CPU Scheduling
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Instruction Scheduling Hal Perkins Autumn 2011
Module 5: CPU Scheduling
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Reconfigurable Computing Dr. Christophe Bobda CSCE Department University of Arkansas

Chapter 4 High-Level Synthesis for Reconfigurable Devices

Agenda Modeling High-level synthesis and Temporal partitioning Dataflow graphs Sequencing graphs Finite State Machine with Datapath High-level synthesis and Temporal partitioning The List-Scheduling-based approach Exact methods: Integer Linear Programming Network-flow-based TP

1. Modeling The success of a hardware platform depends on the easiness of programming Micro processors DSPs High-level descriptions increase the acceptance of a platform Modeling is a key aspect in the design of a system Models used most be powerful enough Capture all user’s need Describe all system parameters Easy to understand and manipulate Several powerful models exists (FSM, State Charts, DFG, Petri Nets, etc…) We focus in this course on two models: Dataflow graphs, sequencing graphs and the Finite State Machine with Datapath(FSMD)

1. Definitions Given a node and its implementation as rectangular shape module in hardware. denotes the length and the height of denotes the area The latency of is the time it takes to compute the function of using the module For a given edge , is the weight of , i.e the width of bus connecting two components and . The latency of is the time needed to transmit data from to Simplification: we will just use for a node as well as for its hardware implementation .

1.1 Dataflow Graph Means to describe a computing task in a streaming mode. Operators represent the nodes Inputs of the nodes are the operand Node’s output can be used as input to other nodes Data dependency in the graph. Given a set of tasks a dataflow graph (DFG) is a directed acyclic graph , where is the set of nodes representing operators and E is the set of edges: An edge is defined through the (data) dependency between task and task . Dataflow graph for the quadratic root computation using:

1.2 Sequencing graph Hierarchical dataflow graph with two different types of nodes: The operation nodes corresponding to normal ”task nodes” in a dataflow graph The link nodes or branching nodes that point to another sequencing graph in a lower level of the hierarchy. Linking nodes evaluate conditional clauses. placed at the tail of alternative paths corresponding to possible branches. Loops can be modeled by using a branching as a tail of two paths one for the exit from the loop the other for the return to the body of the loop

1.2 Sequencing graph Example: According to the conditions that node BR evaluates, one of the two subsequencing graphs (1 or 2) can be activated. Loop description: body of the loop is described in only one sub- sequencing graph: BR evaluates the loop exit condition

1.3 Finite State Machine with Datapath (FSMD) Extension of dataflow graph with a finite state machine Formally, An FSMD is a 7-tuple where: is a set of states, is a set of inputs, is a set of outputs, is a set of variables, is a transition function that maps a tupel (state, input variable, output variable) to a state, is an action function that maps the current state to output and variable, is an initial state. FSMD vs FSM FSMD operates on arbitrary complex data type Transition may include arithmetic operations

1.3 Finite State Machine with Datapath (FSMD) Modeling with FSMD: The transformation of a program into a FSMD is done by Transforming the statements of the program into FSMD states. The statements are first classified in three categories: the assignment statements, the branch statements and the loop statements For an assignment statement a single state is created that executes the assignment action. An arc connecting the so created state with the state corresponding to the next program statement is created.

1.3 Finite State Machine with Datapath (FSMD) For a loop statement a condition state C and a join state J, both with no action are created. An arc, labeled with the loop condition and connecting the conditional state C with the state corresponding to the first statement in the loop body is added to the FSMD. Accordingly another arc is added from C to the state corresponding to the first statement after the loop body. This arc is labeled with the complement of the loop condition. Finally an edge is added from the state corresponding to the last statement in the loop to the join state and another edge is added from the join state back to the conditional state.

1.3 Finite State Machine with Datapath (FSMD) For a branch statement, a condition state C and a join state J both with no action are created. An arc is added from the conditional state to the first statement of the branch. This branch is labeled with the first branch condition. A second arc, labeled with the complement of the first condition ANDed with the second branch condition is added from the conditional state to the first statement of the branch. This process is repeated until the last branch condition. Each state corresponding to the last statement in a branch is then connected to the join state. The join state is finally connected to the state corresponding to the first statement after the branch.

1.3 Finite State Machine with Datapath (FSMD)

2. High-Level synthesis Next: compilation (synthesis) onto a set of hardware components Usually done in three different steps. The allocation defines the number of resource types required by the design, and for each type the number of instances. The binding maps each operation to an instance of a given resource. The scheduling sets the temporal assignment of resources to operators Fundamental difference in reconfigurable computing Only Binding and scheduling are important common high-level synthesis. Resources can be create as wanted in RC

2. High-Level synthesis Dataflow graph of the functions and Assumptions on “resource fixed” device Multiplication needs 100 basic resource (ex 100 NAND-Gates or 100 LUTs). The adder and the subtractor need 50 LUTs each. Despite the availability of resources, two subtractors cannot be used in the first level The adder cannot be used in the first step, it depends on a subtractor that must be first executed Minimum execution time is 4 steps

2. High-Level synthesis Dataflow graph of the functions and Assumptions on a reconfigurable device Multiplication needs 100 basic resource (ex 100 NAND-Gate or 100 LUTs). The adder and the subtractor need 50 LUTs each. Total available amount of resources: 200 LUTs. The two subtractors can be assigned in the first step. Minimum execution time: 3 steps

2. Temporal partitioning – Problem definition Temporal partitioning: Given: Single device temporal partitioning of a DFG G=(V,E) for a device R A temporal partition can also be defined as an ordered partition of G with the constraints imposed by R. With the ordering relation imposed on the partition, we reduce the solution space to only those partitions which can be scheduled on the device for execution. Therefore, cycles are not allowed in the dataflow graph. Otherwise, the resulting partition may not be schedulable on the device Cycle

2. Temporal partitioning Goal: Computation and scheduling of a Configuration graph A configuration graph is a graph in which: Nodes are partitions or bitstreams Edges reflect the precedence constraints in a given DFG Communication through inter-configuration registers configuration sequence is controlled by the host processor On configuration, saved register values. After reconfiguration, copy values back to registers P1 P2 P3 P4 P5 Inter-configuration registers A configuration graph IO Register Bus IO Register IO Register Block IO Register IO Register IO Register IO Register IO Register Processor FPGA Device’s register mapping into the processor address spaces

2. Temporal partitioning 1 2 3 4 5 6 8 7 9 10 Connectivity = 0.24 Objectives: The following objectives are followed: Minimize the number of interconnection: This is one of the most important, since it will minimize : the amount of exchanged data The amount of memory for temporaly storing the data Minimize the number of produced blocks Minimize the overal computation delay Quality of the result: means to measure how good an algorithm performs Connectivity of a graph G=(V,E): con(G) = 2*|E|/(|V|2 - |V|) Quality of Partitioning P = {P1,…,Pn}: Average connectivity over P High quality means the algorithm performs well. Low quality means that the algorithm performs poor. 4 5 1 2 8 7 9 10 3 6 Quality = 0.25 1 3 4 5 6 2 8 7 9 10 Quality = 0.45

2. Temporal partitioning Scheduling: Given is a DFG and an architecture which is a set of processing element Compute the starting time of each node on a given resource Temporal partitioning: Given is a DFG and an a reconfigurable device The starting time of each node is the starting time of the partition to which it belongs Compute the starting time of each node onto the device Solution approaches: List scheduling Integer Linear Programming Network Flow Spectral method

2.1 Unconstrained Scheduling ASAP (as soon as possible) Defines the earliest starting time for each node in the DFG Compute a minimal latency ALAP (as late as possible) Defines the latest starting time for each node in the DFG according to a given latency The mobility of a node ist difference betwen the ALAP-starting time and ASAP-starting time Mobility is 0  node is on a critical path

2.1 ASAP Example Time 0 * * * * + Time 1 * * + < Time 2 - Time 3 - Unconstrained scheduling with optimal latency : L = 4 Zeit 0 Time 0 * * * * + Time 1 * * + < Time 2 - Zeit 3 Time 3 - Zeit 3 Time 4 Zeit 4 Zeit 4

2.1 ASAP Algorithm ASAP(G(V,E),d) { FOREACH( vi without predecessor)  s(vi) := 0; REPEAT { choose a node vi , which predecessors are all planed; s(vi) := maxj:(vj,vi)E {s(vj)+ dj}; } UNTIL (all nodes vi are planned); RETURN s

2.1 ALAP-Example Time 0 * * Time 1 * * Time 2 - * * + Time 3 - + < Unconstrained scheduling with optimal latency : L = 4 Time 0 * * Zeit 1 Time 1 * * Time 2 - * * + Zeit 3 Time 3 - + < Zeit 4 Time 4 Zeit 4

2.1 Mobility Time 0 * - * 1 2 * + < Time 1 * Time 2 + < * Time 3 Zeit 0 Zeit 1 Zeit 2 Zeit 3 Zeit 4 Time 0 * - * 1 2 * + < Time 1 * Time 2 + < * Time 3 Time 4

2.1 ALAP-Algorithm ALAP(G(V,E),d, L) { FOREACH( vi without successor) s(vi) := L - di; REPEAT { Choose a node vi , which successors are all planed; s(vi) := minj:(vi,vj)E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s

2.1 Constrained scheduling Extended ASAP, ALAP Compute ASAP or ALAP Assign the tasks earlier (ALAP) or later (ASAP), untill the resource constraints are fullfiled Listscheduling A list L of ready to run tasks is created Tasks are placed in L in decreasing priority order At a given step, the free resource is assigned the task with highest priority. Criteria can be: number of successors, mobility, connectivity, etc.

2.1 Extended ASAP, ALAP 2 Multiplizierer, 2 ALUs (+, , <) * * * * Time 0 * * Time 1 * * + Time 2 - * * < Time 3 - + Time 4

2.1 Constrained scheduling Criterium: number of successors Ressource: 1 multiplier, 1 ALU (+, , <) 3 2 1 * + - <

2.1 Constrained scheduling Time 0 * + Time 1 * < Time 2 * Time 3 * - Time 4 * Time 5 - * Time 6 + Time 7

2.1 Temporal partitioning vs constrained scheduling List scheduling is used to keep an ordering between the nodes List Scheduling (LS) partitioning Construct a list L of all nodes with priority Create a new empty partition Pact Remove a node from the list and place it in the partition If size(Pact) <= size(R) and T(Pact) <= T(R) go to 2.1, else go to 2.3 If empty(list), stop else goto 2

2.1 Temporal partitioning vs constrained scheduling Criterium: number of successors size(FPGA) = 250, size (mult) = 100, size(add) = size(sub) = 20 size(comp) = 10 3 2 1 * + - <

2.1 Temporal partitioning vs constrained scheduling + < * P3 - 3 2 1 * + - < Connectivity: c(P1) = 1/6, c(P2) = 1/3, c(P3) = 2/6 Quality: 0.83

2.1 Temporal partitioning vs constrained scheduling + < * P3 - 3 2 1 * + - < Connectivity: c(P1) = 2/10, c(P2) = 2/3, c(P3) = 2/3 Quality: 1.2 Connectivity is better

2.1 Temporal partitioning vs constrained scheduling ASAP Processing in topological order Assigning level-number to the nodes Scheduling of nodes according to level number Drawback “Levelization”: Assigned of nodes to partitions based level-number (increase data exchange) Advantage Fast (linear run-time) Local optimization possible * + / * Level 0 + - * Level 1 - / Level 2 Level 3

2.1 Temporal partitioning vs constrained scheduling Local optimization by configuration switching If two consecutive partitions P1 and P2 share a common set of operators, then: Implement the minimal set of operators needed for the two partitions Use signal multiplexing to switch from one partition to the next one. Drawbacks: More resources are needed to implement the signals switching Advantages: reconfiguration time is reduced Device's operation is not interrupted

2.1 Temporal partitioning vs constrained scheduling Configuration 1 b c j h Add Mult g Sub a i d f e b c d e a Mult Add Add h Sub g i f Add Sub Mult j Inter configuration register Configuration 2

2.1 Temporal partitioning vs constrained scheduling Configuration 1 b c j h Add Mult g Sub a i d f e b c d e a Mult Add Add h Sub g i f Add Sub Mult j Inter configuration register Configuration 2

2.1 Temporal partitioning vs constrained scheduling Configuration 1 b c j h Add Mult g Sub a i d f e b c d e a Mult Add Add h Sub g i f Add Sub Mult j Inter configuration register Configuration 2

2.1 Temporal partitioning vs constrained scheduling Improved List Scheduling algorithm Generate the list of nodes node_list Build a first partition P1 while(!node_list.empty( )) build a new partition P2 If union (P1, P2) fits on the device, then implement configuration switching with P1 and P2 else set P1 = P2 and goto 3 Exit

2.2 Temporal partitioning – ILP With the ILP (Integer Linear Programming), the temporal partitioning constraints are formulated as equations. The equations are then solved using an ILP-solver. The constraints usually considered are: Uniqueness constraint Temporal order constraint Memory constraint Resource constraint Latency constraint Notations:

2.2 Temporal partitioning – ILP Unique assignment constraint: Each task must be placed in exactly one partition. Precedence constraint: for each edge in the graph, must be placed either in the same partition like or in a earlier partition than that in which is placed. Resource constraint: The sum of the resource need to implement the modules in one partition should not exceed the total amount of available resources. Device area constraint: Device terminal constraints:

2.3 Temporal partitioning – Network-flow Recursive bipartitioning The goal at each step is the generation of a unidirectional bipartition The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. Network flow methods are used to compute the a bipartition with minimal edge-cut size. Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut

2.3 Temporal partitioning – Network-flow Two-terminal net transformation Replace an edge (v1, v2) with two edges (v1, v2) with capacity 1 and (v2, v1) with infinite capacity Multi-terminal net transformation For a multi-terminal net {v1, v2, .....v2}, Introduce a dummy node v with no weight and a briging (v1, v) with capacity 1. Introduces the egdes (v, v2), .... (v, vn), each of which is assigned a capacity 1. Introduce the edges (v2, v1), ..., (vn, v1), each of which is assigned an infinite capacity Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G.