1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

U of Houston – Clear Lake

ECE 667 Synthesis and Verification of Digital Circuits

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Efficient Parallel Algorithms COMP308

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Martha Garcia.  Goals of Static Process Scheduling  Types of Static Process Scheduling  Future Research  References.

PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.

Reference: Message Passing Fundamentals.

Towards a Realistic Scheduling Model Oliver Sinnen, Leonel Sousa, Frode Eika Sandnes IEEE TPDS, Vol. 17, No. 3, pp , 2006.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids 16-th Euromicro PDP Toulose, February 2008 QoS-CONSTRAINED LIST SCHEDULING.

The Mathematics of Scheduling Chapter 8. How long does it take to build a house? It depends on Size of the house Type of construction Number of workers.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Scheduling Data-Intensive Workflows Tim H. Wong, Daniel Zinn, Bertram Ludäscher (UC.

Graphs, relations and matrices

Scheduling Parallel Task

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Static Process Schedule Csc8320 Chapter 5.2 Yunmei Lu

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Spring 2015 Mathematics in Management Science Machine Scheduling Problem Statement of MSP Assumptions & Goals Priority Lists List Processing Algorithm.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 08, 2005 Session 16.

The El-Rewini/Ali Scheduling of In-Forest Task Graph on Two Processors with Communication Project Presentation By David Y. Feinstein SMU - CSE 8388 Spring.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.

Static Process Scheduling Section 5.2 CSc 8320 Alex De Ruiter

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

CSC 8420 Advanced Operating Systems Georgia State University Yi Pan.

1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.

Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy.

Optimal Algorithms for Task Scheduling Implemented by Ala Al-Nawaiseh Khaled Mahmud.

Lab 2 Parallel processing using NIOS II processors

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 03, 2005 Session 15.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.

Static Process Scheduling

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Graphs. Graph Definitions A graph G is denoted by G = (V, E) where  V is the set of vertices or nodes of the graph  E is the set of edges or arcs connecting.

Concurrency and Performance Based on slides by Henri Casanova.

Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.

Planning and Scheduling.  A job can be made up of a number of smaller tasks that can be completed by a number of different “processors.”  The processors.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Week 11 - Wednesday.  What did we talk about last time?  Graphs  Paths and circuits.

Auburn University

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Matrix Chain Multiplication

Matrix Chain Multiplication

Numerical Algorithms Quiz questions

Planning and Scheduling

Presentation transcript:

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

2 Outline Scheduling models Scheduling without considering communication Including communication in scheduling Heuristic algorithms

3 Partitioner Grains of Sequential Code Parallel/Distributed System Parallel Program Tasks Scheduler Schedule Processors Time Program Tasks Sequential Program Explicit Approach Implicit Approach Dependence Analyzer Ideal Parallelism Scheduling Parallel Tasks

4 Program Tasks Task Notation: (T, <, D, A) T  set of tasks <  partial order on T D  Communication Data A  amount of computation

F 20 A 5 Task Graph 10 D 15 E 10 B 15 C 10 G 15 H I Task Amount of Computation Communication Data Dependency

6 Machine m heterogeneous processors Connected via an arbitrary interconnection network (network graph) Associated with each processor P i is its speed S i Associated with each edge (i,j) is the transfer rate R ij

7 Task Schedule Gantt Chart Mapping (f) of tasks to a processing element and a starting time Formally: f(v) = (i,t)  task v is scheduled to be processed by processor i starting at time t

8 Gantt Chart

9 Gantt Chart with Communication

10 Execution and Communication Times If task t i is executed on p j Execution time = A i /S j The communication delay between t i and t j, when executed on adjacent processing elements p k and p l is D ij /R kl

11 Complexity Computationally intractable in general Small number of polynomial optimal algorithms in restricted cases A large number of heuristics in more general cases schedule schedulerQuality of the schedule vs. Quality of the scheduler

12 Scheduling Task Graphs without considering communication Polynomial-Time Optimal Algorithms in the following cases: 1.Task graph is in-forest: each node has at most one immediate successor, or out-forest: each node has at most one immediate predecessor 2.Task graph is an interval order

In-Forest vs. Out-Forest Structure In-ForestOut-Forest 13

14 Assumptions A task graph consisting of n tasks A distributed system made up of m processors The execution time of each task is one unit of time Communication between any pair of tasks is zero The goal is to find an optimal schedule, which minimizes the completion time

15 List Scheduling All considered algorithms belong to the list scheduling class. Each task is assigned a priority, and a list of tasks is constructed in a decreasing priority order. A task becomes ready for execution when its immediate predecessors in the task graph have already been executed or if it does not have any predecessors.

16 Scheduling Inforest/Outforest task graphs 1.The level of each node in the task graph is calculated as given above and used as each node’s priority 2.Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority

17 Example 1: Simple List Scheduling Scheduling

Example 2: Simple List Scheduling TaskPriority A5 B5 C5 D4 E4 F4 G4 H3 I3 J3 K2 L2 M1 18 ABC D EF H IJ KL M G tProcessors 0P1P2P3P4 1ABCE 2DFGH 3IJL 4K 5M Priority Assignment Scheduling

CDE F GH IJ KL M Priority Assignment Scheduling AB Example 3: Simple List Scheduling 19

20 Interval Orders A task graph is an interval order when its nodes can be mapped into intervals on the real line, and two elements are related iff the corresponding intervals do not overlap. For any interval ordered pair of nodes u and v, either the successors of u are also successors of v or the successors of v are also successors of u.

21 Scheduling interval ordered tasks 1.The number of successors of each node is used as each node’s priority 2.Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority

22 Example 1: Scheduling Interval Ordered tasks

Example 2: Scheduling Interval Ordered tasks 23 TaskPriority A8 B6 C5 D5 E4 F1 G3 H0 I0 J0 23 AB C DE FG IJH tProcessors 0P1P2P3 1AB 2CDE 3GF 4HIJ Priority Assignment Scheduling

Example 3: Scheduling Interval Ordered tasks 24 AB C DE G KLH Priority Assignment Scheduling F IJH

25 Communication Models Completion Time –Execution time –Communication time Completion Time as 2 Components Completion Time from the Gantt Chart

26 Completion Time as 2 Components Completion Time = Execution Time + Total Communication Delay Total Communication Delay = Number of communication messages * delay per message Execution time  maximum finishing time of any task Number of communication messages  –Model A –Model B

27 Completion Time from the Gantt Chart (Model C) Completion Time = Schedule Length This model assumes the existence of an I/O processor with every processor in the system Communication delay between two tasks allocated to the same processor is negligible. Communication delay is counted only between two tasks assigned to different processors

28 Example A 1 D 1 E 1 B 1 C 1 Assume a system with 2 processors

29 Models A and B Assume tasks A, B, and D are assigned to P1 and tasks C and E are assigned to p2 A B D P1 C E P2 Model A Number of messages = 2 Completion time = Model B Number of messages = 1 Completion time = A 1 D 1 E 1 B 1 C 1

30 Model C A B CD E Communication Delay P1P A 1 D 1 E 1 B 1 C 1

31 A 4 D 5 E 3 B 9 C 7 L 1 M 1 F 1 G 1 I 1 H 1 K 1 J 1 Processors P1P2P3 A BCD EHJ FLK GM HI Model A B Task Assignment Processors P1P2P3 A B BCD EHJ FK GL HM I Model C Task Assignment Model A Number of Messages = Completion time = 3 + (2*4 + 2*3) = 17 Model B Number of Messages = = 3 Completion time = 3 + (2*4 + 1*3) = 14 Model C Completion time = 8 Communication delay is displayed in the graph for A & B. Assume execution time of a task is 1. (assume all communication delay is 1 for simplicity) Models A,B,C Example

32 Heuristics A heuristic produces an answer in less than exponential time, but does not guarantee an optimal solution. Communication delay versus parallelism Clustering Duplication

33 Communication Delay versus Parallelism

34 Clustering

Clustering Example 1 Part 1 35 A B C ED F G TimeP1P2 1A 2 B 3C 4D 5 6E 7 8F 9 10G Task Assignment 1 Communication Delay NOP

Clustering Example 1 Part 2 36 A B C ED F G TimeP1P2 1A 2 B 3C 4D 5 6 E 7 8F 9G Task Assignment 1 Communication Delay NOP

37 Clustering Example 2 37 A B D FE G H TimeP1P2 1A 2 B 3 4 5D 6D 7 C E 8E 9F 10F 11G 12 13H Task Assignment C Communication Delay NOP

38 Duplications

Duplication Example (Using Clustering Example 1 Part 2) 39 A B C ED F G TimeP1P2 1AA 2 B C 3D 4 5 E 6 7F 8G Task Assignment 1 Communication Delay NOP

40 Scheduling and grain packing Four major steps are involved in the grain determination and the process of scheduling optimization: –Step 1. Construct a fine-grain program graph. –Step 2. Schedule the fine-grain computation. –Step 3. Grain packing to produce the coarse grains. –Step 4. Generate a parallel schedule based on the packed graph.

41 Program decomposition for static multiprocessor scheduling two 2 x 2 matrices A and B are multiplied to compute the sum of the four elements in the resulting product matrix C = A x B. There are eight multiplications and seven additions to be performed in this program, as written below:

42 Example 2.5 Ctd’ –C 11 = A 11  B 11 + A 12  B 21 –C 12 = A 11  B 12 + A 12  B 22 –C 21 = A 21  B 11 + A 22  B 21 –C 22 = A 21  B 11 + A 22  B 22 –Sum = C 11 + C 12 + C 21 + C 22

43

44

45