Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Multiple Processor Systems
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.
Courseware Scheduling of Distributed Real-Time Systems Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Michela Milano
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
System design-related Optimization problems Michela Milano Joint work DEIS Università di Bologna Dip. Ingegneria Università di Ferrara STI Università di.
High-level System Modeling and Power Management Techniques Jinfeng Liu Dept. of ECE, UC Irvine Sep
Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei Embedded Systems Laboratory Linköping University,
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
System Partitioning Kris Kuchcinski
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Chapter 11 Operating Systems
System-Wide Energy Minimization for Real-Time Tasks: Lower Bound and Approximation Xiliang Zhong and Cheng-Zhong Xu Dept. of Electrical & Computer Engg.
Courseware Basics of Real-Time Scheduling Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads, Building.
CS533 - Concepts of Operating Systems
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,
Scheduling Parallel Task
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
What is Concurrent Programming? Maram Bani Younes.
9/14/2015B.Ramamurthy1 Operating Systems : Overview Bina Ramamurthy CSE421/521.
Column Generation Approach for Operating Rooms Planning Mehdi LAMIRI, Xiaolan XIE and ZHANG Shuguang Industrial Engineering and Computer Sciences Division.
CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Network Aware Resource Allocation in Distributed Clouds.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Concurrency, Mutual Exclusion and Synchronization.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Lecture 2 Foundations and Definitions Processes/Threads.
Real-Time Operating Systems for Embedded Computing 李姿宜 R ,06,10.
1 Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs Lin Huang, Rong Ye and Qiang Xu CHhk REliable computing laboratory (CURE) The Chinese.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Summary :-Distributed Process Scheduling Prepared By:- Monika Patel.
1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
6.1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.
Jamie Unger-Fink John David Eriksen.  Allocation and Scheduling Problem  Better MPSoC optimization tool needed  IP and CP alone not good enough  Communication.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Pradeep Konduri Static Process Scheduling:  Proceedance process model  Communication system model  Application  Dicussion.
Combinatorial Optimization for Embedded System Design
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Operating Systems : Overview
Real-time Software Design
Chapter 6: CPU Scheduling
Operating Systems CPU Scheduling.
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
CPU scheduling decisions may take place when a process:
Operating Systems : Overview
Operating Systems : Overview
Operating Systems : Overview
Presented By: Darlene Banta
Chapter 6: CPU Scheduling
Presentation transcript:

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini Communication-Aware Stochastic Allocation and Scheduling Framework for Conditional Task Graphs in Multi-Processor Systems-on-Chip Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna, DEIS - Italy

Outline Motivations Our approach Problem Model Methodology Experimental Results Conclusions

Task Graph T1 T2 T3 T4 T5 T6 T8 T7 … Proc. 1 Proc. 2 Proc. N INTERCONNECT Private Mem Allocation T1 T2 T3 T4 T5 T6 T7 T8 Schedule Time Resources T1 T2 T3 T4 T5 T7 Deadline T8 Many realistic applications can only be specified as conditional task graphs The problem of allocating and scheduling conditional task graphs on processors in a distributed real-time system is NP-hard. New tool flows for efficient mapping of multi-task applications onto hardware platforms

Starting Implementation Optimization Analysis Design flow graph Optimization Development Abstraction gap Platform Modelling Starting Implementation Optimization Analysis Final Implementation Optimal Solution ( . . Platform Execution The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours. Programmers must be conscious about simplified assumptions taken into account in optimization tools. New methodology for multi-task application development on MPSoCs.

Outline Motivations Our approach Problem Model Methodology Experimental Results Conclusions

Our approach Our Focus: Our Objectives: Statically scheduled Conditional Task Graph Applications; Our Objectives: Complete approach to allocation and scheduling: High computational efficiency w.r.t. commercial solvers; High accuracy of generated solutions; New methodology for multi-task application development: To quickly develop multi-task applications; To easily apply the optimal solution found by our optimizer.

Target architecture - 1 An architectural template for a message-oriented distributed memory MPSoC: Support for message exchange between the computation tiles; Single-token communication; Availability of local memory devices at the computation tiles and of remote memories for program data. Several MPSoC platforms available on the market match this template: The Silicon Hive Avispa-CH1 processor; The Cradle CT3600 family of multiprocessor; The Cell Processor The ARM MPCore platform. The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor; . Act. A Act. B Act. N period T

Target architecture - 2 Homogeneous computation tiles: ARM cores (including instruction and data caches); Tightly coupled software-controlled scratch-pad memories (SPM); AMBA AHB; DMA engine; RTEMS OS; Cores use non-cacheable shared memory to communicate; Semaphore and interrupt facilities are used for synchronization; Private on-chip memory to store data.

Target Application: Conditional Task Graph (CTG) Seldom target applications behaves in same ways between several executions: they contain cycles, conditional jumps or other elements of variability. FORK A CTG is a triple <T,A,C>, where: T is the set of nodes modelling generic tasks (e.g. elementary operations, subprograms, ...); A the set of arcs modelling precedence constraints (e.g. due to data communication); C is a set of conditions, each one associated to an arc, modelling what should be true in order to choose that branch during execution (e.g. the condition of a if-then-else construct). Extension to the generic task graph model with stochastic elements: Conditional Branches; Conditional Nodes; Branch Nodes. AND BRANCH N N N OR

Task memory requirements System Bus Private Mem ARM Core Int controller SPM Semaphores #1 #2 Each task has three kinds of memory requirements: Program Data; Internal State; Communication queues. Program Data & Internal State can be allocated by Optimizer: On the local SPM; On the remote Private Memory. The communication task might run: On the same processor → negligible communication cost On a remote processor → costly message exchange procedure Optimizer constraint: Communication queues only in SPM → more efficient message passing

Task memory requirements System Bus Private Mem ARM Core Int controller SPM Semaphores #1 Each task has three kinds of memory requirements: Program Data; Internal State; Communication queues. #2 Program Data & Internal State can be allocated by Optimizer: On the local SPM; On the remote Private Memory. The communication task might run: On the same processor → negligible communication cost On a remote processor → costly message exchange procedure Optimizer constraint: Communication queues only in SPM → more efficient message passing

Outline Motivations Our approach Problem Model Methodology Experimental Results Conclusions

Logic Based Benders Decomposition Memory constraints Obj. Function: Communication cost ALLOCATION: INTEGER PROGRAMMING Valid allocation No good: linear constraint Real Time constraint SCHEDULING: CONSTRAINT PROGRAMMING Decomposes a problem into 2 sub-problems: Allocation → IP Scheduling → CP The process continues until the master problem and sub-problem converge providing the same value. Methodology has been proven to converge to the optimal solution [J.N.Hooker and G.Ottosson].

Allocation problem model Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Each process can execute only on one processor Program data and internal state can be allocated locally on a PE only if the task run on it Communication queue of arcr can be locally only if both the source and the destination tasks run on a PEj The sum of locally allocated structures cannot exceed the SPM capacity

Allocation problem model The objective function: the minimization of the amount of data transferred on the bus Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Bus Mem CPU

Bus Traffic modelling Equal to 1 if task i internal state is remotely allocated Equal to 1 if task i program data is remotely allocated Activation function equal to 1 if task i executes Activation function equal to 1 if task i and k execute Equal to 1 if communication queue is remotely allocated

Bus Traffic modelling The minimization of a stochastic function Given an allocation these two terms are constants The minimization of a stochastic function is a very complex operation (even more than exponential)

Bus Traffic modelling Every stochastic dependence is removed And Existence and coexistence probabilities of tasks Constant terms Every stochastic dependence is removed And The expected value is reduced to a deterministic expression We developed two polynomial cost algorithms to compute these probabilities

Scheduling problem model INPUT RS EXEC WS OUTPUT Five phases behaviour INPUT=input data reading; RS=internal state reading; EXEC=computation activity; WS=internal state writing; OUTPUT=output data writing. Not breakable activities The adopted schema and precedence relations vary with the type of the corresponding node (or/and, branch/fork) Since the objective function depends only on the allocation, Scheduling is just a feasibility problem We decided to provide a unique worst case schedule, forcing each task to execute after all its predecessors in any scenario

Outline Motivations Our approach Problem Model Methodology Experimental Results Conclusions

Efficient Application Development Support In optimization tools many simplifying assumptions are generally considered The neglecting of these assumptions in software implementation can generate: unpredictable and not desired system-level interactions; make the overall system error-prone. We propose an entire framework to help programmers in software implementation: a generic customizable application template  OFFLINE SUPPORT; a set of high-level APIs  ONLINE SUPPORT. The main goals of our development framework are: the exact and reliable application’s execution after the optimization step; guarantees about high performance and constraint satisfaction.

Customizable Application Template Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. Programmer can intuitively translate high level representation into C-code using our facilities and library. Users can specify: the number of tasks included in the target application; their nature (e.g. branch, fork, or-node, and-node); their precedence constraints (e.g. due to data communication); ….thus quickly drawing its CTG schema. Programmer can focus onto the functionalities of the tasks: the main effort is given to the more specific and critic sections of the application.

OS-level and Task-level APIs Users can easily reproduce optimizer solutions, thus: Indirectly neglecting optimizer’s abstractions Task model; Communication model; OS overheads. Obtaining the needed application constraint satisfaction. Programmer can allocate to the right hardware resources Tasks; Program data; Queues. Scheduling support APIs Communication issues Shared queues; Semaphores; Interrupts.

Example Number of nodes : 12 Graph of activities Node type Normal, Branch, Conditional, Terminator Node behaviour Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities a2 a1 fork T2 B2 B3 T3 branch branch a3 a4 a5 a6 C4 T4 C5 T5 T6 C6 C7 T7 a7 a8 a9 a10 or N8 T8 N9 T9 N10 T10 a12 //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..}; a11 or uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; #define TASK_NUMBER 12 N11 T11 a13 #define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; and a14 T12 T12 Deadline Resources B3 B3 C7 C7 N10 N10 N1 B2 C4 N8 N11 T12 T12 Time

Queue ordering optimization CPU1 CPU2 T1 Wait! C3 C1 T4 RUN! C2 T2 C4 C5 T3 … T5 T6 … … … Communication ordering affects system performances

Queue ordering optimization CPU1 CPU2 T1 Wait! C3 C1 T4 RUN! C2 T2 C4 C5 T3 … T5 T6 … … … Communication ordering affects system performances

Synchronization among tasks Proc. 1 Proc. 2 C1 T2 T4 T1 T3 T4 T2 C2 C3 T4 is suspended T4 re-activated T3 Non blocked semaphores

Application Development Methodology Simulator Optimizer Application Profiles CTG Characterization Phase Optimization Phase Allocation Scheduling Application Development Support Optimal SW Application Implementation Platform Execution

Outline Motivations Our approach Problem Model Methodology Experimental Results Conclusions

Computational Efficiency 2 groups of instances: slightly structured very short tracks quite often contain singleton nodes; completely structured one head, one tail, long tracks The solution times are of the same order of the deterministic case

Validation of optimizer solutions Optimal Allocation & Schedule Virtual Platform validation MAX error lower than 10%; AVG error equal to 4.8%, with standard deviation of 2.41;

Validation of optimizer solutions Differences are marginal; All the deadline constraints are satisfied.

Conclusions Cooperative framework to solve the allocation and scheduling problem to optimality for conditional task graphs onto MPSoCs; Logic-Based Benders Decomposition; New development methodology; Solutions validated by means of a complete MPSoC virtual platform; Experimental results proved accuracy of the problem model.