High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Target Code Generation
ECE 667 Synthesis and Verification of Digital Circuits
CSCI 4717/5717 Computer Architecture
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
The Control Unit: Sequencing the Processor Control Unit: –provides control signals that activate the various microoperations in the datapath the select.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Design Flow – Computation Flow. 2 Computation Flow For both run-time and compile-time For some applications, must iterate.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Dynamic NoC. 2 Limitations of Fixed NoC Communication NoC for reconfigurable devices:  NOC: a viable infrastructure for communication among task dynamically.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
ICS 252 Introduction to Computer Design
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
The Structure of the CPU
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Automated Design of Custom Architecture Tulika Mitra
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Gary MarsdenSlide 1University of Cape Town Chapter 5 - The Processor  Machine Performance factors –Instruction Count, Clock cycle time, Clock cycles per.
Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Electronic Analog Computer Dr. Amin Danial Asham by.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Modern VLSI Design 3e: Chapter 8 Copyright  1998, 2002 Prentice Hall PTR Topics n Basics of register-transfer design: –data paths and controllers; –ASM.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Basic Computer Organization and Design
IIT Kharagpur & Kingston Uni
Register Transfer Specification And Design
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Structural style Modular design and hierarchy Part 1
CS1251 Computer Architecture
Introduction to cosynthesis Rabi Mahapatra CSCE617
Reconfigurable Computing
Reconfigurable Computing
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Architecture Synthesis
HIGH LEVEL SYNTHESIS.
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Target Code Generation
Presentation transcript:

High-Level Synthesis for Reconfigurable Systems

2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level synthesis and temporal partitioning 1.List-scheduling-based approach 2.Exact methods: Integer linear programming 3.Network-flow-based TP

3 High-Level Synthesis New Computing Platforms:  Its success depends on the easiness of programming −Microprocessors −DSPs Parallel Computing:  Why not as successful as sequential programming?  Parallel implementation of a given application requires two steps: 1.The application must be partitioned to identify the dependencies among the parts (to set the parts that can be executed in parallel). −Requires a good knowledge on the structure of the application. 2.Mapping must allocate the independent blocks of the application to the computing hardware resources. −Requires a good understanding of the underlying hardware structure. Reconfigurable computing:  Faces similar problem: −Must write code in an HDL −Not easy to decouple the implementation process from the hardware  High-level descriptions increase the acceptance of a platform

4 Modeling High-level descriptions:  Modeling is a key aspect in the design of a system  Models used must be powerful enough −Capture all user’s need −Easy to understand and manipulate Several powerful models exists:  FSM,  State Charts,  DFG,  Petri Nets, …… Focus in this course:  Dataflow graphs,  Sequencing graphs,  Finite State Machine with Datapath (FSMD)

5 Dataflow Graph DFG:  Means to describe a computing task in a streaming mode.  Operators: nodes  Operands: Inputs of the nodes  Node’s output can be used as input to other nodes −Data dependency in the graph. Given a set of tasks {T 1,…,T k }, a DFG is a DAG G(V,E), where  V (= T): the set of nodes representing operators and  E: the set of edges representing data dependency between tasks. DFG for the quadratic root computation using:

6 Definitions  The latency t i of v i : −The time it takes to compute the function of v i using H vi node v i  V hihi lili a i = l i × h i implementation H vi vivi vivi vjvj w ij  Weight w ij of e ij  E: −width of bus connecting two components H vi and H vj  Latency t ij of e ij : −the time needed to transmit data from H vi to H vj  Simplification: we will just use v i for a node as well as for its hardware implementation H vi.

7 DFG  Any high-level program can be compiled into a DFG, provided that: −No loops −No branching instructions. Therefore:  DFG is used for parts of a program.  Extend to other models: −Sequencing graph −FSMD −CDFG

8 Sequencing Graph Sequencing Graph:  Hierarchical DFG with two different types of nodes: 1.Operation nodes: normal ”task nodes” in a DFG 2.Link nodes or branching nodes: point to another sequencing graph in a lower level of the hierarchy. Properties:  Linking nodes evaluate conditional clauses.  If placed at the tail of alternative paths, correspond to possible branches.

9 Sequencing Graph Example:  According to the conditions that node BR evaluates, one of the two sub-sequencing graphs (1 or 2) can be activated.  Loop description: −Body of the loop is described in only one sub-sequencing graph: BR evaluates the loop exit condition

10 Finite State Machine with Datapath (FSMD) FSMD:  Extension of DFG with an FSM  is a 6-tuple :  S = {s 0, …, s l }: states,  I = {i 0, …, i m }: inputs,  O = {o 0, …, o n }: outputs,  F: S × I × O  S: a transition function that maps a tuple (s i, i j,o k ) to a state,  H: S  O: an action function that maps the current state to output,  s 0 : an initial state. FSMD vs. FSM  FSMD operates on arbitrary complex data types −Not just Boolean vars  Transition may include arithmetic operations

11 FSMD 1.Assignment statement  a single state is created that executes the assignment action.  An arc connecting the state with the state of the next statement is created. Modeling with FSMD:  The transformation of a program into a FSMD is done by −Transforming the statements of the program into FSMD states. −The statements are first classified in three categories: 1.assignment statements, 2.branch statements and 3.loop statements

12 FSMD: Loop 2.Loop statement:  A condition state C and a join state J, both with no action are created.  an arc is added: −label: the loop condition −connects C to the state of the first statement in the loop body.  another arc is added: −label: complement of the loop condition −connects C to the first statement after the loop body.  an edge is added: −connects: the state of the last statement in the loop to the join state  another edge is added: −connects: join state back to the conditional state.

13 FSMD: Branch 3.Branch statement:  a condition state C and a join state J, both with no action are created.  an arc is added: −label: the first branch condition −connects C to the state of the first statement of the branch.  another arc is added: −label: complement of the first condition ANDed with the second branch condition −connects C to the first statement of the branch.  ….  Each state corresponding to the last statement in a branch is connected to the join state.  Join state is connected to the state corresponding to the first statement after the branch.

14 FSMD: Example

15 FSMD: Datapath Synthesis Datapath Synthesis Steps: 1.A register for each variable. 2.An FU for each arithmetic operation. 3.Connecting FU ports to registers. 4.Creating a unique identifier for each control input/output of FUs. 5.(Resource sharing by MUXes).

16 High-Level Synthesis High-Level Synthesis (Architectural Synthesis):  Transforming an abstract model of circuit behavior into a data path and a control unit. Steps: 1.Allocation: defines the resource types required by the design, and for each type the number of instances. 2.Binding: maps each operation to an instance of a given resource. 3.Scheduling: sets the temporal assignment of resources to operators. −Decides which operator owns the resource at a given time

17 Allocation Allocation (Formal Definition):  For a given specification with a set of operators or tasks T = {t 1, t 2, · · ·, t n } to be implemented on a set of resource types R = {r 1, r 2, · · ·, r t }, allocation is a function α : R → Z +, where α(r) = z i is the number of available instances of resource type r i

18 Allocation Example:  T = {*, *, *, -, -, *, *, *, +, +, <}  R = {ALU, MUL} = {1,2} −ALU: add, sub, compare α(1) = 5 α(2) = 6

19 Allocation Another Definition:  α : T → Z +, −Operators in the specification are mapped to resource types. α(t 1 ) = 2 α(t 2 ) = 2 … α(t 5 ) = 1 … α(t 11 ) = 1  More than one resource type may exists for an operator: −{CLA, RipAdd, BoothMul, SerMul,..} −  Resource selection needed (α is one-to-many)

20 Binding Binding (Formal Definition):  For T = {t 1, t 2, · · ·, t n } and R = {r 1, r 2, · · ·, r t }, binding is a function β : T → R × Z +, where β(t i ) = (r i, b i ), (1 ≤ b i ≤ α(r i )) is the instance of the resource type r i on which t i is mapped to.

21 Binding Example:  T = {*, *, *, -, -, *, *, *, +, +, <}  R = {ALU, MUL} = {1,2} −ALU: add, sub, compare β(t 1 ) = (2,1) β(t 2 ) = (2,2) β(t 3 ) = (2,3) β(t 4 ) = (1,1) β(t 5 ) = (1,2) β(t 6 ) = (2,4) β(t 7 ) = (2,5) β(t 8 ) = (2,6) … β(t 11 ) = (1,5)

22 Scheduling Scheduling (Formal Definition):  For T = {t 1, t 2, · · ·, t n }  scheduling is a function ς : V → Z +, where ς(t i ) denotes the starting time of task t i.

23 High-Level Synthesis Fundamental differences in RCS: 1.  Uniform resources: −  It is possible to implement any task on a given part of a device (provided that the available resource are enough).

24 General vs. RCS High-Level Synthesis Example: Assumptions on “resource fixed” device:  Multiplication needs 100 basic resource unit  The adder and the subtractor need 50 units each.  Allocation selects “one” instance of each resource type. −  Two subtractors cannot be used in the first level.  The adder cannot be used in the first step −due to data dependency  Minimum execution time: 4 steps add sub mul * x *

25 General vs. RCS High-Level Synthesis Assumptions on a reconfigurable device  Multiplication needs 100 LUTs.  Adder/subtractor need 50 LUTs each.  Total available amount of resources: 200 LUTs.  The two subtractors can be assigned in the first step.  Minimum execution time: 3 steps * *

26 High-Level Synthesis Fundamental differences in RCS: 2.  In general HLS:  Application is specified using a structure that encapsulates a datapath and a control part.  Synthesis process allocates the resources to operators at different time according to a computed schedule.  Control part is synthesized.  In RCS:  Hardware modules implemented as datapath normally compete for execution on the chip.  A processor is used to control selection process of the hardware modules by means of reconfiguration.  The same processor is also in charge of activating the single resources in the corresponding hardware accelerators.

27 Temporal Partitioning  Resources on the device are not allocated to only one operator but to a set of operators that must be placed at the same time and removed.  An application must be partitioned in sets of operators.  The partitions will then be successively implemented at different time on the device. Temporal Partitioning

28 Configuration Configuration:  Given a reconfigurable processing unit H and  a set of tasks T = {t 1,...., t n } available as cores C = {c 1,...., c n },  we define the configuration ζ i of the RPU at time s i to be the set of cores ζ i = {c i1,..., c ik }  C running on H at time s i. A core (module) c i for each t i in library:  Hard / soft / firm module.

29 Firm vs. Hard Modules Placement of firm modules Placement of hard modules

30 Schedule Schedule:  is a function ς : V → Z +, where ς(v i ) denotes the starting time of the node v i that implements a task t i. Feasible Schedule:  ς is feasible if:  e ij = (v i, v j )  E, ς(t j ) ≥ ς(t i ) + T(t i ) + t ij −e ij defines a data dependency between tasks t i and t j, −t ij is the latency of the edge e ij, −T(t i ) is the time it takes the node v i to complete execution.

31 Ordering Relation Ordering relation ≤  v i ≤ v j   schedule ς, ς(v i ) ≤ ς(v j ). −≤ is a partial ordering, as it is not defined for all pairs of nodes in G.

32 Partition Partition:  A partition P of the graph G = (V,E) is its division into some disjoint subsets P 1,..., P m such that U k=1,…,m P k = V Feasible Partition:  A partition is feasible in accordance to a reconfigurable device H with area a(H) and pin count p(H) if:   P k  P: a(P k ) = (∑ vi  Pk a i ) ≤ a(H)  1/2∑ eij  E w ij ≤ p(H) −for e ij = crossing edges Crossing edge:  an edge that connects one component in a partition with another component out of the partition.

33 Run Time Run time of a partition r(P i ):  the maximum time from the input of the data to the output of the result.

34 Ordering Relation Ordering relation for partitions:  P i ≤ P j   v i  P i,  v j  P j −either v i ≤ v j −or v i and v j are not in relation. Ordered partitions:  A partitioning P is ordered  an ordering relation ≤ exists on P.  If P is ordered, then for a pair of partitions, one can always be implemented after the other with respect to any scheduling relation.

35 Temporal Partitioning Temporal partitioning:  Given a DFG G = (V,E) and a reconfigurable device H, a temporal partitioning of G on H is an ordered partitioning P of G with respect to H.

36  Cycles are not allowed in DFG. − Otherwise, the resulting partition may not be schedulable on the device. Cycle Temporal Partitioning

37 Goal:  Computation and scheduling of a Configuration graph A configuration graph is a graph in which:  Nodes are partitions or bitstreams  Edges reflect the precedence constraints in a given DFG Configuration Graph P1P1 P2P2 P3P3 P4P4 P5P5 Temporal partitioning Formal Definition:  Given a DFG G = (V,E)  and a temporal partitioning P = {P1,..., Pn} of G, we define a Configuration graph of G relative to the P, with notation Γ(G/P) = (P,E P ) in which the nodes are partitions in P. An edge e P = (P i, P j )  E P   e = (v i, v j )  E with v i  P i and v j  P j. Configuration:  For a given partition P, each node P i  P has an associated configuration ζ i that is the implementation of P i for the given device H.

38 Whenever a new partition is downloaded, the partition that was running is destroyed.  Communication through inter-configuration registers (or communication memory) − May sit in main memory − May sit at the boundary of the device to hold the input and output values  Configuration sequence is controlled by the host processor P1P1 P2P2 P3P3 P4P4 P5P5 Inter-configuration registers Temporal partitioning IO Register Processor Bus Block IO Register FPGA Device’s register mapping into the processor address spaces

39 Steps (for P i and P j, (P i ≤ P j ) : 1.Configuration for P i is first downloaded into the device. 2.Executes. 3.P i copies all the data it needs to send to other partitions into the communication memory. 4.The device is reconfigured to implement the partition P j 5.Accesses the communication memory and collect the data. P1P1 P2P2 P3P3 P4P4 P5P5 Inter-configuration registers Temporal partitioning

40 Objectives for optimization: 1. # interconnections: very important, since it minimizes:  The mount of exchanged data  The amount of memory for temporally storing the data 2. # produced blocks (partitions)  Reduces the number of reconfigurations (total time?) 3. Overall computation delay depends on  the partition run time  the processor used for reconfiguration  speed of data exchange 4. Similarity between consecutive partitions (for partial) 5. Overall amount of wasted resources on the chip.  When components with shorter run-times are placed in the same partition with other components with longer run-time, those with the shorter components remain idle for a longer period of time. Temporal partitioning

41 Wasted Resources Wasted resource wr(v i ) of a node v i :  Unused area occupied by the node v i during the computation of a partition wr(v i ) = (t(P i )−T(t i ))×a i t(P i ): run-time of partition P i. T(t i )): run-time of the component v i a i : area of v i Wasted resource wr(P i ) of a partition (P i = {v i1,.., v in }: wr(P i ) =  j =1,…,n wr(v i ) Wasted resource of a partitioning P: wr(P) =  j =1,…,k wr(P j ) Run time Area

42 Communication Cost: modelled as graph connectivity: Connectivity of a graph G=(V,E): con(G) = 2*|E|/(|V| 2 - |V|)  |V| 2 - |V|: the number of all edges that can be built with V Connectivity = 0.24 Communication Overhead

43 Quality of Partitioning P = {P1,…,Pn}:  Average connectivity over P: Q(P) = 1/n  i=1,…,n con(P i ) Quality = Quality = Connectivity = 0.24 Communication Overhead

44 Communication Overhead Minimizing communication overhead by  minimizing the weighted sum of crossing edges among the partitions. −  minimize the size of the communication memory and −  minimize the communication time. Heuristic:  Highly connected components are placed in the same partition (High quality partitioning)

45 References  [Bobda07] Christophe Bobda, “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications,” Springer, 2007.