Synchronous Data Flow Presenter: Zohair Hyder Gang Zhou Synchronous Data Flow E. A. Lee and D. G. Messerschmitt Proc. of the IEEE, September, 1987. Joint.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

NP-Hard Nattee Niparnan.
ECE 667 Synthesis and Verification of Digital Circuits
Some Graph Algorithms.
 Review: The Greedy Method
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CSC 421: Algorithm Design & Analysis
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
DATAFLOW PROCESS NETWORKS Edward A. Lee Thomas M. Parks.
Synthesis of Embedded Software Using Free-Choice Petri Nets.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Multiscalar processors
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
Concept of Basic Time Complexity Problem size (Input size) Time complexity analysis.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Joint Minimization of Code and Data for Synchronous Dataflow Programs Kaushik Ravindran EE 249 – Presentation.
1 IOE/MFG 543 Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Scheduling Parallel Task
1 Compiling with multicore Jeehyung Lee Spring 2009.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.
S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz
1 Recursion Dr. Bernard Chen Ph.D. University of Central Arkansas.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
MIT and James Orlin1 NP-completeness in 2005.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Greedy Algorithms and Matroids Andreas Klappenecker.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Chapter 3 Algorithms Complexity Analysis Search and Flow Decomposition Algorithms.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Data Structure and Algorithms
1 Chapter 8 Recursion. 2 Recursive Function Call a recursion function is a function that either directly or indirectly makes a call to itself. but we.
LINKED LISTS.
Buffering Techniques Greg Stitt ECE Department University of Florida.
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
CSC 421: Algorithm Design & Analysis
CSCI1600: Embedded and Real Time Software
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
CSC 421: Algorithm Design & Analysis
Major Design Strategies
Parallel Programming in C with MPI and OpenMP
CSC 421: Algorithm Design & Analysis
CSCI1600: Embedded and Real Time Software
Major Design Strategies
Time Complexity and the divide and conquer strategy
Presentation transcript:

Synchronous Data Flow Presenter: Zohair Hyder Gang Zhou Synchronous Data Flow E. A. Lee and D. G. Messerschmitt Proc. of the IEEE, September, Joint Minimization Of Code And Data For Synchronous Dataflow Programs P. K. Murthy, S. S. Bhattacharyya, and E. A. Lee Memorandum No. UCB/ERL M94/93, Electronics Research Laboratory, University of California at Berkeley

Motivation Imperative program does not often exhibit concurrency in the algorithm Data flow model is a programming methodology which naturally breaks a task into subtasks for more efficient use of concurrent resources

Data flow principle Any node can fire whenever input data are available A node with no input arcs may fire at any time Implication: concurrency Data-driven Nodes are free of side effects (nodes influence each other only through data passed through arcs)

Synchronous data flow The number of tokens produced or consumed by each node is fixed a priori for each firing Special case: homogeneous SDF (one token per arc per firing) A delay d is implemented by initializing arc buffer with d zero samples

Precedence graph One cycle of blocked schedule for three processors Observation: pipelining results in multiprocessor schedules with higher throughput

Large grain data flow Large grain data flow can reduce overhead associated with each node invocation Self loops are required to store state variables Provides hierarchical graphical descriptions of applications

Implementation architecture A sequential processor Isomorphic hardware mapping Homogeneous parallel processors sharing memory without contention

Formalism: topology matrix b(n+1) = b(n) + Γ v(n)

Scheduling for single processor Theorem 1: For a connected SDF graph with s nodes and topology matrix Γ, rank(Γ)=s-1 is a necessary condition for a PASS (periodic admissible sequential schedule) to exist. Theorem 2: For a connected SDF graph with s nodes and topology matrix Γ with rank(Γ)=s-1, we can find a positive integer vector q≠0 such that Γq=0. Definition: A class S algorithm is any algorithm that schedule a node if it is runnable, updates b(n) and stops only when no more nodes are runnable. Theorem 3: For a connected SDF graph with topology matrix Γ and a positive integer vector q s.t. Γq=0, if a PASS of period p=1 T q exists, then any class S algorithm will find such a PASS.

Theorem 4: For a connected SDF graph with topology matrix Γ and a positive integer vector q s.t. Γq=0, a PASS of period p=1 T q exists if and only if a PASS of period N*p exists for any integer N. Corollary: For a connected SDF graph with topology matrix Γ and any positive integer vector q s.t. Γq=0, a PASS of period p=1 T q exists if and only if a PASS of period r=1 T v exists for any v s.t. Γv=0.

Sequential scheduling algorithm 1.Solve for smallest positive integer vector s.t. Γq=0 2.Form arbitrary ordered list L of all nodes in the system 3.For each α in L, schedule α if it is runnable, trying each node once 4.If each node has been schedule q α times, stop and declare a successful sequential schedule 5.If no node in L can be scheduled, indicate a deadlock 6.Else go to 3 and repeat

Scheduling for parallel processors A blocked PAPS (periodic admissible parallel schedule) is a set of lists {Ψ i,i=1,…M} where M is the number of processors and Ψ i is periodic schedule for processor i A blocked PAPS must invoke each node the number of times given by q=J*p for some positive number J called blocking factor (here p is the smallest positive integer vector in the null space of Γ) Goal of schedule: avoid deadlock and minimize iteration period (run time for one cycle of blocked schedule divided by J) Solution: problem identical to assembly line problem in operations research, it is NP complete, but good heuristic methods exist (Hu- level-scheduling algorithm)

Class S algorithm Initialization: i=0; q(0)=0; Main body: while nodes are runnable { for each α in L { if α is runnable then { create the j = ( q α (i) + 1 )th instance of the node α; for each input arc a on α { let β be the predecessor node for arc a compute d using d = ceil( ( - j * Γ a α - b a ) / Γ a β ); if d < 0 then let d = 0; establish precedence links with the first d instances of β; } let v(i) be a vector with zeros except a 1 in position α; let b(i+1) = b(i) + Γ v(i) let q(i+1) = q(i) + v(i); let i = i+1; }

J=1 J=2 J=1 Parallel scheduling: example

Gabriel

JOINT MINIMIZATION OF CODE AND DATA FOR SYNCHRONOUS DATAFLOW PROGRAMS - Praveen K. Murthy, Shuvra S. Bhattacharyya, and Edward A. Lee

Chain-structured SDFs Well-ordered SDFs

Acyclic SDFs Cyclic SDFs

Schedules Assume no delays in model q-vector, indicates frequency of actors {A, B, C, D} firings in any cycle, e.g.: [1, 2, 3, 4] T Can have different schedules: ABBCCCDDDD = A(2B)(3C)(4D) DDCBABCCDD = (2D)CBAB(2C)(2D) ABDDBDDCCC = A2(B(2D))(3C) Each APPEARANCE is compiled with its associated actor’s code! Loops are compiled once.

Objectives Code-minimization: –Use single-appearance schedules Data-minimization: –Minimize buffer sizes needed –Assume individual buffers for each edge Example: consider actors A and B with q = [… 2, 2 …] T –…ABAB… –…AABB… –…2(AB)…

Buffering Cost Consider schedule: (9A)(12B)(12C)(8D) Buffering cost: = 72 Alternate single-appearance schedule: (3(3A)(4B))(4(3C)(2D)) Buffering cost: = 30 Why not use shared buffer? –Pros: Only need maximum buffering size of any arc = 36 –Cons: 1. Implementation difficult for looped schedules 2. Addition of delays to the model modifies implementation considerably

Delays Individual edge (e) buffers simplify addition of delays (d). –Buffering cost (e) = Original cost + d For shared buffers, if we add delays, our circular buffer will be inconsistent Example for q = [147, 49, 28, 32, 160]T –Buffer size = 28*8 = 224 –A = 1-147B = C = …

A Useful Fact A factoring transformation on any valid schedule yields a valid schedule: (9A)(12B)(12C)(8D) => (3(3A)(4B))(4(3C)(2D)) Buffer requirement of factored schedule <= that of original schedule

R-schedules Definition Definition: A schedule is fully reduced if it is completely factored (9A)(12B)(12C)(8D) => (3(3A)(4B))(4(3C)(2D)) => (9A)(4(3BC)(2D)) Factoring can be applied recursively to subgraphs to factor out the GCD at each step. Result can differ depending on how graph is split: –Split between B & C => (3(3A)(4B))(4(3C)(2D)) –Split between A & B => (9A)(4(3BC)(2D)) Set of schedules obtained this way is called set of R-schedules Larger graphs: many R-schedules! Grows as Ω(4 n /n)

R-schedules Buffering Cost For (9A)(12B)(12C)(8D), total 5 schedules: –(3(3A)(4B)) (4(3C)(2D))Cost: 30 –(3(3A) (4(BC))) (8D)Cost: 37 –(3((3A)(4B) (4C))) (8D)Cost: 40 –(9A) (4(3BC) (2D))Cost: 43 –(9A) (4(3B) (3C)(2D))Cost: 45 Theorum: The set of R-schedules always contains a schedule that achieves the minimum buffer memory requirement over all valid single appearance schedules. However too many R-schedules to do exhaustive search!

Dynamic Programming Finding optimal R-schedule is an optimal paranthesization problem. Two-actor subchains are examined, and the buffer memory requirements for these subchains are recorded. This information is then used to determine the minimum buffer memory requirement and the location of the split that achieves this minimum for each three-actor subchain. And so on… Details of algorithm described in paper. Time complexity = Θ(n 3 )

Dynamic Programming – Some Details Each i-j pair represents subgraph between i-th and j-th nodes 1.Continue for three-actor subchains, then four-actor chains… 2.After finding location of split for n-actor chain, do top-down traversal to get optimal R-schedule

Example: Sample Rate Conversion Convert 44,100 to 48,000 (CD to DAT) 44100:48000 = : q = [147, 147, 98, 28, 32, 160] T Optimal nested schedule: (7(7(3AB)(2C))(4D))(32E(5F)) Buffering cost: 264 Flat schedule, cost: 1021 Shared buffer, cost: 294

Example: Sample Rate Conversion (2) Some advantages of nested schedules: Latency: Last actor in sequence doesn’t have to wait as much. –Nested schedule has 29% less latency Less input buffering for nested schedules.

A Heuristic Work top-down. Introduce split at edge where least amount of data is transferred. Least amount of data determined with GCD of q vector elements. Continue recursively with subgraphs. Time complexity: Worst-case: O(n 2 ) Average case: O(n x log n) So-so performance

Extensions to Dynamic Programming Delays can easily be handled by modifying the cost function. Similarly, algorithm easily extends to well-ordered SDFs. Algorithm can extend to general acyclic SDFs: –Use a topological sort: ordering where sources occur before sinks for every edge. –For example: ABCD, ABDC, BACD, BADC –Can find optimal R-schedule for a certain topological sort.

Extensions to Dynamic Programming (2) Difficulties for acyclic SDFs: –There may be too many topological sorts for any SDF. –For sort ABCD => (3(4A)(3(4B)C))(16D)Cost: 208 –For sort ABDC => (4(3A)(9B)(4D))(9C)Cost: 120 –Number of sorts can grow exponentially with n. –Problem of finding optimal topological sort is NP-complete. q = [12, 36, 9, 16] T

Conclusion Single-appearance schedules minimize code size. Can find optimal single-appearance schedules to minimize buffer size. Dynamic programming approach can yield results for chain- structured, well-ordered, and acyclic SDFs. Can also work for cyclic SDFs if valid single-appearance schedule exists.