S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz

Slides:

Advertisements

Similar presentations

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

Advertisements

Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.

COMMUNICATING SEQUENTIAL PROCESSES C. A. R. Hoare The Queen’s University Belfast, North Ireland.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

ECE 667 Synthesis and Verification of Digital Circuits

1 EE5900 Advanced Embedded System For Smart Infrastructure RMS and EDF Scheduling.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Chapter 4 Retiming.

Evaluating Heuristics for the Fixed-Predecessor Subproblem of Pm | prec, p j = 1 | C max.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz April 7 th, 2010 Youngjoon Jo.

Resource Allocation, Deadlock and Banker’s Algorithm Supplementary Notes Dr. R. D. Kent Last modified: Dec. 11, 2006.

DATAFLOW PROCESS NETWORKS Edward A. Lee Thomas M. Parks.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

I MPLEMENTING S YNCHRONOUS M ODELS ON L OOSELY T IME T RIGGERED A RCHITECTURES Discussed by Alberto Puggelli.

Synchronous Data Flow Presenter: Zohair Hyder Gang Zhou Synchronous Data Flow E. A. Lee and D. G. Messerschmitt Proc. of the IEEE, September, Joint.

ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.

CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

1 1 Slide © 2000 South-Western College Publishing/ITP Slides Prepared by JOHN LOUCKS.

Scheduling Parallel Task

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

Chapter 2 The Fundamentals: Algorithms, the Integers, and Matrices

Zvi Kohavi and Niraj K. Jha 1 Capabilities, Minimization, and Transformation of Sequential Machines.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

TRANSACTIONS. Objectives Transaction Concept Transaction State Concurrent Executions Serializability Recoverability Implementation of Isolation Transaction.

Introduction to Algorithms By Mr. Venkatadri. M. Two Phases of Programming A typical programming task can be divided into two phases: Problem solving.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

ICS 313: Programming Language Theory Chapter 13: Concurrency.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Several sets of slides by Prof. Jennifer Welch will be used in this course. The slides are mostly identical to her slides, with some minor changes. Set.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Static Process Scheduling

Presented by: Belgi Amir Seminar in Distributed Algorithms Designing correct concurrent algorithms Spring 2013.

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Memory Consistency Models

Parallel Programming By J. H. Wang May 2, 2017.

2. Specification and Modeling

Memory Consistency Models

CSCI1600: Embedded and Real Time Software

COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60621 Fundamentals of Parallel and Distributed Systems

ICS 252 Introduction to Computer Design

COMP60611 Fundamentals of Parallel and Distributed Systems

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

Abstract Synchronous data flow (SDF) differs from traditional data flow The schedule of SDF nodes can be done at compile time (statically) Contribution of this paper: Develop theory for static scheduling of SDF programs on single or multiple processors - Synchronous data flow differs from traditional data flow in that the amount of data produced and consumed by a data flow node is specified a priori for each input and output

Introduction Need to depart from the simplicity of von Neumann computer architecture Programming signal processors using large grain data flow languages [W. B. Ackerman 82] Ease the programming Enhancing the modularity of code Describe algorithms in more naturally Concurrency is immediately evident from program description - Concurrency is immediately evident from program description so parallel hardware can be used more effectively

Data Flow Analysis [W. B. Ackerman 82] P = X + Y Q = P/Y R = X*P S = R – Q T = R*P RESULT = S/T Many of these instructions can run in parallel as long as some constraints are met These constraints can be represented by a graph Node represents instructions Arrow between nodes represents constraints So, the permissible computation sequence can be for example (1, 3, 5, 2, 4, 6), (1, 2, 3, 5, 4, 6) and others. - An arrow from one instruction to another means that the second may not be executed until the first has been completed

Sequencing Constraints (1) P = X + Y (3) R = X*P (2) Q = P/Y (4) S = R - Q (5) T = R*P (6) RESULT = S/T

The Data Flow Paradigm A program is divided into pieces (nodes or blocks) which can execute whenever input data are available An algorithm can be described as data flow graph Node representing function Arc representing data paths Signal processing algorithms can also be described as data flow graph Node is atomic or non-atomic function Arc is signal path

The Data Flow Paradigm Contd. The complexity of the functions (granularity) will determine the amount of parallelism available No attempt to exploit concurrency inside a block The functions within the blocks can be specified using von Neumann programming techniques The blocks can themselves represent another data flow graph (hierarchical) LGDF is ideally suited for signal processing

Synchronous Data Flow Graphs A block is invoked when input available When it invoked it consumes a fixed number of input samples on each input path and produces fixed number of output samples A block is synchronous if we can specify a priori its input and output samples when it is invoked Assuming that the signal processing system repetitively apply an algorithm to an infinite sequence of data

A synchronous data flow graph B A C b c e j d f g h i SDF graph requires buffering the data samples passed between blocks and schedule blocks when data are available (static approach) This could be done dynamically (runtime supervisor, costly approach)

A synchronous data flow graph SDF graphs can be scheduled statically (at compile time) regardless of the number of processors No need to have dynamic control Communication between nodes and processors is set up by the compiler so no runtime control Thus the LGDF paradigm gives the programmer a natural way for programming with evident concurrency

Scheduling an SDF graph Schedule blocks onto processors in such a way that data is available during its invocation Assumptions The SDF graph is non terminating (without dead lock) The SDF graph is connected Goal is to find a periodic admissible parallel schedule (PAPS also PASS) Non terminating is natural for signal processing if SDF is not connected then each separate graphs can be scheduled separately using subsets of the processors

Construction of a PASS Topology matrix e c 1 2 1 i d 2 3 g f 3 Connection to the outside world is ignored Correctly constructed self loop has equal amount of data produced and consumed so the net difference will be zero This matrix need not be square in general

Construction of a PASS Replace each arc with FIFO queue to pass data from one block to another (vary) Vector b(n) contains the queue sizes of all the buffers at time n For sequential schedule only one block can be invoked at a time v(n) is the vector of blocks invoked at time n

Construction of a PASS 2 1 3 D 2D The change in the buffer size caused by invoking a node is A unit delay on an arc from A to B means that a n-th sample consumed by node B is (n-1)-th sample produced by node A So the first sample consumed by destination block is not produced by the source (part of initial state of arc buffer)

Construction of a PASS 2 1 3 D 2D Because of this initial condition block 2 can be invoked once and block 3 can be invoked twice before block 1 is invoked at all Delay therefore affect the way the system starts up

Construction of a PASS Given this computation model (eqn. 1 - 4) Find necessary and sufficient conditions for existing a PASS, and hence a PAPS Find practical algorithms that provably finds a PASS if one exists Find a practical algorithms that construct reasonable (not necessarily optimal) PAPS, if a PASS exists

Necessary condition for existing a PASS Where s is the number of nodes or blocks in the graph Definition 1: an admissible sequential schedule is a non-empty ordered list of nodes such that if the nodes are executed in sequence given by , the amount of buffer will remain non negative and bounded. Each node must appear in at least once

Quick reminder of rank of a matrix The rank of a matrix is the maximum number of independent rows The rank can be calculated by gaussian elimination algorithm 2nd column is the twice of column 1

Necessary condition for existing a PASS 1 Theorem 1: For a connected SDF graph with s nodes and topology matrix Γ, rank (Γ) = s -1 is a necessary condition for a PASS to exist. PASS of period p (3)=> b(p) = b(0) + Γq where 2 2 1 1 1 2 1 2 3 1 3 q vector tells us how many times we should invoke each node in one period of a PASS after a period the buffer will end up once again in their initial state

Necessary condition for existing a PASS Since the PASS is periodic, we can write Since the PASS is admissible, the buffers must remain bounded, by definition 1. The buffers remain bounded if and only if where O is a vector full of zeros For q ≠ O, this implies that rank (Γ) < s where s is the dimension of q. But rank (Γ) can be either s or s – 1, and so it must be s – 1 [Lemma 3]

Necessary condition for existing a PASS 1 Theorem 1 indicates that if we have a SDF graph with a topology matrix of rank s, then the graph is somehow defective and no PASS can be found for it 2 1 1 1 1 2 1 2 3 1 3 Any schedule for this graph will result either in deadlock or unbounded buffer sizes The rank of the topology matrix indicates s sample rate inconsistency in the graph

Necessary condition for existing a PASS Theorem 2: For a connected SDF graph with s nodes and topology matrix Γ, and with rank(Γ) = s – 1, we can find a positive integer vector q ≠ O such that Γq = O where O is the zero vector. Definition 2: A predecessor to a node x is a node feeding data to x.

Necessary condition for existing a PASS Definition 3: (Class S algorithm) Given a positive integer vector q such that Γq = O and an initial state for the buffers b(0), the ith node is runnable at a given time if it has not been run times and running it will not cause a buffer size to go negative. A class S algorithm is any algorithm that schedules a node if it is runnable, updates b(n) and stops only when no more nodes are runnable. If class S algorithms terminates before it has scheduled each node the number of times specified in the q vector, then it is said to be deadlocked.

Necessary condition for existing a PASS Theorem 3: Given a SDF graph with topology matrix Γ and given a positive integer vector q s.t. Γq = O, if a PASS of period p = exists, where is a row vector full of ones, any class S algorithm will find such a PASS.

Necessary condition for existing a PASS 1 1 1 1 1 1 D 2 2 - Networks with insufficient delays in directed loops are not computable 1 1 2 2 (a) (b) Two SDF graph with consistent sample rates but no admissible schedule

Necessary condition for existing a PASS Theorem 4: Given a SDF graph with topology matrix Γ and given a positive integer vector q s.t. Γq = O, a PASS of period p = exists if and only if a PASS of period Np exists for any integer N. Theorem 4 tells us that it does not matter what positive integer vector we use from the null space of the topology matrix, so we can simplify our system by using the smallest such vector, thus obtaining a PASS with minimum period.

Class S algorithm given the theorems Solve for the smallest positive integer vector Form an arbitrary ordered list L of all nodes in the system For each , schedule if it is runnable, trying each node once If each node has been scheduled times, STOP If no node in L can be scheduled, indicate a deadlock Else goto 3 and repeate

Constructing a PAPS If a workable schedule for a single processor can be generated then a workable schedule for a multiprocessor system can also be generated First step is to construct an acyclic precedence graph for J period of the PASS by class S algorithm

Construct an acyclic precedence graph by example 2 1 This graph is neither acyclic nor a precedence graph Possible minimum PASS is {1, 3, 1, 3}, {3, 1, 1, 2} or {1, 1, 3, 2} each with period 4. {2, 1, 3, 1} not a PASS because node 2 is not immediately runnable 3 1 1 D 2 2 1

Construct an acyclic precedence graph 1 1 2 3 2 1 1 1 3 2 3 1 J=1 J=2

Next step constructing a parallel schedule By critical path method [Adam 74] or by Hu-level scheduling algorithm [T. C. Hu 61] A level is determined for each node in the acyclic precedence graph, where the level of a given node is the worst case of the total of the runtimes of nodes on a graph from the given node to a terminal node of the graph The terminal node is a node with no successor If there is no terminal node then one can be created with zero runtime

Hu-level scheduling algorithm 6 5 3 1 3 2 1 2 3 2 6 1 3 1 3 1 2 3 3 2 6 3 3 1 J=1 J=2

Constructing a parallel schedule Hu-level scheduling algorithm simply schedules available nodes with the highest level first When there are more than available nodes with the same highest level than there are processors, a reasonable heuristic is to schedule the ones with the longest runtime first

Constructing a parallel schedule 3 3 1 3 PROC 1 PROC 1 PROC 2 1 1 2 PROC 2 1 1 2 1 2 J=1 J=2 Two processors, runtime of nodes 1,2,3 are 1, 2,3 time units respectively

Limitations of Model Do not greater scale conditional control flow like general purpose languages Asynchronous graphs Connecting to the outside world Data dependent runtime of blocks

Summary This paper describes the theory necessary to develop a signal processing programming methodology that offers Programmer convenience Natural way to describe signal processing Readily use the available concurrency

Question? Thank you