Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Reference: Message Passing Fundamentals.
Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.
Dynamic Load Balancing Experiments in a Grid Vrije Universiteit Amsterdam, The Netherlands CWI Amsterdam, The
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
1 Dong Lu, Peter A. Dinda Prescience Laboratory Computer Science Department Northwestern University Virtualized.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.
Omar Darwish.  Load balancing is the process of improving the performance of a parallel and distributed system through a redistribution of load among.
Mapping Techniques for Load Balancing
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Distributed Asynchronous Bellman-Ford Algorithm
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
CS 420 Design of Algorithms Parallel Algorithm Design.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Operating Systems (CS 340 D)
Modeling and Simulation (An Introduction)
Dynamic Graph Partitioning Algorithm
Operating Systems (CS 340 D)
Parallel Programming in C with MPI and OpenMP
A Distributed Bucket Elimination Algorithm
Chapter 12: Analysis of Algorithms
COMP60621 Fundamentals of Parallel and Distributed Systems
Analysis of Algorithms
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens, Greece 5th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar'06)

2 Talk outline The problem The problem Inter-slave synchronization mechanism Inter-slave synchronization mechanism Overview of Distributed Trapezoid Self- Scheduling algorithm Overview of Distributed Trapezoid Self- Scheduling algorithm Self-Adapting Scheduling Self-Adapting Scheduling Stochastic environment modeling Stochastic environment modeling Results Results Conclusions & future work Conclusions & future work

3 The problem we study Scheduling problems with task dependencies Scheduling problems with task dependencies Target systems: stochastic systems, i.e., real non-dedicated heterogeneous systems with fluctuating load Target systems: stochastic systems, i.e., real non-dedicated heterogeneous systems with fluctuating load Approach: adaptive dynamic load balancing Approach: adaptive dynamic load balancing

4 Algorithmic model & notations for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) {... for (i n =l n ; i n <=u n ; i n ++) { S 1 (I);... S k (I); }... } Loop Body  Perfectly nested loops  Constant flow data dependencies  General program statements within the loop body  J – index space of an n-dimensional uniform dependence loop   – set of dependence vectors

5 More notations  P 1,...,P m – slaves  VP k – virtual computing power of slave P k  Σ m k=1 VP k – total virtual computing power of the system  q k – number of processes/jobs in the run-queue of slave P k, reflecting the its total load  – available computing power of slave P k  Σ m k=1 A k – total available computing power of the system

6 Synchronization mechanism (1) C i – chunk size at the i -th scheduling step C i – chunk size at the i -th scheduling step V i – projection of C i along scheduling dimension u c V i – projection of C i along scheduling dimension u c u 1 – is called the synchronization dimension, noted u s ; synchronization u 1 – is called the synchronization dimension, noted u s ; synchronization points are introduced along u s u 2 – is called the scheduling dimension, noted u c ; chunks are formed along u c u 2 – is called the scheduling dimension, noted u c ; chunks are formed along u c (IPDPS’06)

7 Synchronization mechanism (2) SP j – synchronization point SP j – synchronization point M – the number of SPs along synchronization dimension u s M – the number of SPs along synchronization dimension u s H – the interval between two SPs; H is the same for every chunk H – the interval between two SPs; H is the same for every chunk SC i,j – set of iterations of chunk C i between SP j-1 and SP j SC i,j – set of iterations of chunk C i between SP j-1 and SP j Current slave – the slave assigned chunk C i Current slave – the slave assigned chunk C i Previous slave – the slave assigned chunk C i-1 Previous slave – the slave assigned chunk C i-1

8 C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) Afterwards, P k receives from P k-1 the data required for the current computation Afterwards, P k receives from P k-1 the data required for the current computation communication set set of points computed at moment t+1 set of points computed at moment t indicates communication auxiliary explanations P k+1 P k P k-1 SP j C i+1 C i C i-1 SP j+1 SP j+2 SC i,j+1 SC i-1,j+1 t t t +1 Synchronization mechanism (3) Slaves do not reach an SP at the same time ―> a wavefront execution fashion H should be chosen so as to maintain the comm/comp < 1

9 Overview of Distributed Trapezoid Self-Scheduling (DTSS) Algorithm Divides the scheduling dimension into decreasing chunks Divides the scheduling dimension into decreasing chunks First chunk is F = |u c |/(2× Σ m k=1 A k ), where: First chunk is F = |u c |/(2× Σ m k=1 A k ), where: |u c | – the size of the scheduling dimension |u c | – the size of the scheduling dimension Σ m k=1 A k – the total available computational power of the system Σ m k=1 A k – the total available computational power of the system Last chunk is L = 1 Last chunk is L = 1 N = (2* |u c |)/(F+L) – the number of scheduling steps N = (2* |u c |)/(F+L) – the number of scheduling steps D = (F-L)/(N-1) – chunk decrement D = (F-L)/(N-1) – chunk decrement C i = A k * [F – D*(S k-1 +(A k -1)/2)], where: C i = A k * [F – D*(S k-1 +(A k -1)/2)], where: S k-1 = A 1 + … + A k-1 S k-1 = A 1 + … + A k-1 DTSS selects chunk sizes based on: DTSS selects chunk sizes based on: the virtual computational power of a processor, V k the virtual computational power of a processor, V k the number of processes in the run-queue of each processor, q k the number of processes in the run-queue of each processor, q k (IPDPS’06)

10 Self-Adapting Scheduling – SAS (1) SAS is a self-scheduling scheme SAS is a self-scheduling scheme It is NOT a decrease-chunk algorithm It is NOT a decrease-chunk algorithm Built upon the master-slave model Built upon the master-slave model Each chunk size is computed based on: Each chunk size is computed based on: history of computation times of previous chunks on the particular slave history of computation times of previous chunks on the particular slave history of jobs in the run-queue of the particular slave history of jobs in the run-queue of the particular slave current number of jobs in the run-queue of the particular slave current number of jobs in the run-queue of the particular slave Targeted for stochastic systems Targeted for stochastic systems

11 Self-Adapting Scheduling – SAS (2) Terminology used with SAS Terminology used with SAS V k 1, …, V k j – the sizes of the first j chunks assigned to P k and t k 1, …, t k j – their computation times V k 1, …, V k j – the sizes of the first j chunks assigned to P k and t k 1, …, t k j – their computation times V k 1 > … > V k j, and V k 1 =1 V k 1 > … > V k j, and V k 1 =1 q k 1, …, q k j – number of jobs in the run-queue of P k when assigned its first j chunks q k 1, …, q k j – number of jobs in the run-queue of P k when assigned its first j chunks – the average time/iteration for the first j chunks of P k – the average time/iteration for the first j chunks of P k – the estimated computation time for executing its j+1-th chunk – the estimated computation time for executing its j+1-th chunk t Ref – the execution time of the first chunk of the problem (reference time); all processors are expected to compute their chunks within t Ref t Ref – the execution time of the first chunk of the problem (reference time); all processors are expected to compute their chunks within t Ref

12 SAS – description (1) Master side : Master side : Initialization: (M.a) Register slaves; store each reported VP k and initial load q k 1. (M.b) Sort slaves in decreasing order of their VP k, considering VP 1 =1; assign first chunk to each slave, i.e., V k 1. While there are unassigned iterations do: (M.1) Receive request from P k and store its reported q k j+1 and t k j. (M.2) Determine, where * If P k took > t Ref to compute its j-th chunk, its j+1-th chunk will be decreased, and vice versa.

13 SAS – description (2) Slave side : Slave side : Initialization: Register with the master; report VP k and initial load q k 1. (S.1) Send request for work to the master; report current load (q k j+1 ) and the time (t k j ) spent on completing the previous chunk. (S.2) Wait for reply/work. If no more work to do, terminate. If no more work to do, terminate. Else receive the size of next chunk (V k j+1 ) and compute it. Else receive the size of next chunk (V k j+1 ) and compute it. (S.3) Exchange data at SPs as described in slide 7. (S.4) Measure the completion time t k j+1 for chunk V k j+1. (S.5) Go to (S.1).

14 Stochastic Environment Modeling (1) Existing real non-dedicated systems have fluctuating load Existing real non-dedicated systems have fluctuating load Fluctuating load is non-deterministic ―> Fluctuating load is non-deterministic ―> stochastic process modeling The incoming foreign jobs inter-arrival time is considered to be exponentially distributed The incoming foreign jobs inter-arrival time is considered to be exponentially distributed (λ – arrival rate) The incoming foreign jobs lifetime is considered to be exponentially distributed (μ – service rate) The incoming foreign jobs lifetime is considered to be exponentially distributed (μ – service rate)

15 inter-arrival time Stochastic Environment Modeling (2) t Ref 2*t Ref inter-arrival time time q k j =2+1 q k j =1+1 q k j =0+1 q k j =1+1 fast medium slow parallel job foreign job system load fluctuation inter-arrival time > t Ref inter-arrival time < t Ref inter-arrival time ~ t Ref λ ~ μ inter-arrival time ~ service time inter-arrival time

16 Implementation and testing setup  The algorithms are implemented in C and C++  MPI is used for master-slave and inter-slave communication  The heterogeneous system consists of 7 dual-node machines (12+2 processors):  3 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VP k = 1  4 Intel Pentiums III, 800 MHz with 256MB RAM (called kids), assumed to have VP k = 0.5. (one of them is the master)  Interconnection network is Fast Ethernet, at 100Mbit/sec  One real-life application: Floyd-Steinberg error dithering computation  The synchronization interval is H = 100

17 Results with fast, medium and slow load fluctuations

18 Conclusions Loops with dependencies can now be dynamically scheduled on stochastic systems Loops with dependencies can now be dynamically scheduled on stochastic systems Adaptive load balancing algorithms efficiently compensate for the system’s heterogeneity and foreign load fluctuations for loops with dependencies Adaptive load balancing algorithms efficiently compensate for the system’s heterogeneity and foreign load fluctuations for loops with dependencies

19 Future work Establish a model for predicting the optimal synchronization interval H and minimize the communication Establish a model for predicting the optimal synchronization interval H and minimize the communication Modeling the foreign load with other probabilistic distributions & analyze the relationship between distribution type and performance Modeling the foreign load with other probabilistic distributions & analyze the relationship between distribution type and performance

20 Thank you! Questions?