Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Hadi Goudarzi and Massoud Pedram

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

OpenFOAM on a GPU-based Heterogeneous Cluster

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.

Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.

Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Optimal Client-Server Assignment for Internet Distributed Systems.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Robin McDougall Scott Nokleby Mechatronic and Robotic Systems Laboratory 1.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.

Kanpur Genetic Algorithms Laboratory IIT Kanpur 25, July 2006 (11:00 AM) Multi-Objective Dynamic Optimization using Evolutionary Algorithms by Udaya Bhaskara.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Exploring Parallelism with Joseph Pantoga Jon Simington.

Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

Authors: Jiang Xie, Ian F. Akyildiz

Dynamic Graph Partitioning Algorithm

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡ and George Papakonstantinou † † National Technical University of Athens Computing Systems Laboratory ‡ University of Texas at San Antonio 20th International Parallel and Distributed Processing Symposium April 2006

April 27, 2006IPDPS 20062Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS 20063Introduction  Motivation for dynamically scheduling loops with dependencies: Existing dynamic algorithms can not cope with dependencies, because they lack inter-slave communication Static algorithms are not always efficient In their original form, if dynamic algorithms are applied to loops with dependencies, they yield a serial/invalid execution

April 27, 2006IPDPS 20064Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS 20065Notation  Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies General program statements within the loop body J – index space of an n-dimensional uniform dependence loop

April 27, 2006IPDPS 20066Notation u 1 – synchronization dimension, u n – scheduling dimension – set of dependence vectors PE – processing element P 1,...,P m – slaves N – number of scheduling steps C i – chunk size at the i -th scheduling step V i – size (iteration-wise) of C i along scheduling dimension u n VP k – virtual computing power of slave P k Q k – number of processes in the run-queue of slave P k – available computing power of slave P k – total available computing power of the cluster

April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS Some existing self-scheduling algorithms CSS and TSS are devised for homogeneous systems DTSS improves on TSS for heterogeneous systems by selecting the chunk sizes according to: the virtual computational power of the slaves, V k the number of processes in the run-queue of each PE, Q k  3 self-scheduling algorithms:  CSS – Chunk Self-Scheduling, C i = constant  TSS – Trapezoid Self-Scheduling, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×m) and the last chunk is L = 1.  DTSS – Distributed TSS, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1. u1u1 u2u2 V i+1 V i V i-1 V1V1 VNVN... DTSSDTSSS TSSTSSS CSSCSSS C i+1 C i C i-1

April 27, 2006IPDPS Some existing self-scheduling algorithms AlgorithmChunk sizes CSS TSS DTSS (dedicated) DTSS (non- dedicated)  | J |=5000×10000  m = 10 slaves  CSS and TSS give the same chunk sizes both in dedicated and non- dedicated systems, respectively  DTSS adjusts the chunk sizes to match the different A k of slaves

April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS More notation SP – synchronization point M – number of SPs inserted along synchronization dimension u 1 H – interval (iteration-wise) between two SPs along u 1 H – is the same for every chunk SC i,j – the set of iterations of C i between SP j-1 and SP j C i = V i × M × H Current slave – the slave assigned chunk C i Previous slave – the slave assigned chunk C i-1

April 27, 2006IPDPS Self-scheduling with synchronization Chunks are formed along scheduling dimension, here say u 2 SPs are inserted along synchronization dimension, u 1 Phase 1: Apply self-scheduling algorithms to the scheduling dimension Phase 2: Insert synchronization points along the synchronization dimension

April 27, 2006IPDPS The inter-slave communication scheme C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) Afterwards, P k receives from P k-1 the data required for the current computation  Slaves do not reach a SP at the same time, which leads to a wavefront execution fashion communication set set of points computed at moment t +1 set of points computed at moment t indicates communication auxiliary explanations P k+1 P k P k-1 SP j C i+1 C i C i-1 SP j+1 SP j+2 SC i,j+1 SC i-1,j+1 t t t +1

April 27, 2006IPDPS Dynamic Multi-Phase Scheduling DMPS(x) INPUT : (a) An n -dimensional dependence nested loop. (b) The choice of the algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size C i. (d) The synchronization interval H. (e) The number of slaves m ; in case of DTSS, the virtual power V k of every slave. Master Master: Initialization: (M.a) Register slaves. In case of DTSS, slaves report their A k. (M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given C i. While there are unassigned iterations do: (M.1) If a request arrives, put it in the queue. (M.2) Pick a request from the queue, and compute the next chunk size using CSS, TSS or DTSS. (M.3) Update the current and previous slave ids. (M.4) Send the id of the current slave to the previous one.

April 27, 2006IPDPS Dynamic Multi-Phase Scheduling DMPS(x) Slave P k Slave P k : Initialization: (S.a) Register with the master. In case of DTSS, report A k. (S.b) Compute M according to the given H. (S.1) Send request to the master. (S.2) Wait for reply; if received chunk from master, go to step 3, else go to OUTPUT. (S.3) While the next SP is not reached, compute chunk i. (S.4) If id of the send-to slave is known, go to step 5, else go to step 6. (S.5) Send computed data to send-to slave (S.6) Receive data from the receive-from slave and go to step 3. OUTPUT Master Master: If there are no more chunks to be assigned, terminate. Slave P k Slave P k : If no more tasks come from master, terminate.

April 27, 2006IPDPS  Advantages of DMPS(x)  Can take as input any self-scheduling algorithm, without any modifications  Phase 2 is independent of Phase 1  Phase 1 deals with the heterogeneity & load variation in the system  Phase 2 deals with minimizing the inter-slave communication cost  Suitable for any type of heterogeneous systems Dynamic Multi-Phase Scheduling DMPS(x)

April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS Implementation and testing setup  The algorithms are implemented in C and C++  MPI platform is used for master-slave and inter-slave communication  The heterogeneous system consists of 10 machines:  4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VP k = 1.5 (one of them is the master)  6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids), assumed to have VP k = 0.5.  Interconnection network is Fast Ethernet, at 100Mbit/sec.  Dedicated system: all machines are dedicated to running the program and no other loads are interposed during the execution.  Non-dedicated system: at the beginning of program’s execution, a resource expensive process is started on some of the slaves, halving their A k.

April 27, 2006IPDPS Implementation and testing setup  System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6.  Three series of experiments for both dedicated & non-dedicated systems, for m = 3,4,5,6,7,8,9 slaves: 1)DMPS(CSS) 2)DMPS(TSS) 3)DMPS(DTSS)  Two real-life applications: heat equation, Floyd-Steinberg computation  Speedup S p is computed with: where T P i – serial execution time on slave P i, 1 ≤ i ≤ m, and T PAR – parallel execution time (on m slaves)  In the plotting of S p, VP is used instead of m on the x -axis.

April 27, 2006IPDPS Performance results – Heat equation Sync. interval H Dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)

April 27, 2006IPDPS Performance results – Heat equation Sync. interval H Non-dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)

April 27, 2006IPDPS Performance results – Floyd-Steinberg Sync. interval H Dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)

April 27, 2006IPDPS Performance results – Floyd-Steinberg Sync. interval H Non-dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)

April 27, 2006IPDPS Interpretation of the results Dedicated system: as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one. DMPS(TSS) slightly outperforms DMPS(CSS ) for parallel loops, because it provides better load balancing DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity Non-dedicated system: DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations The speedup for DMPS(DTSS) increases in all cases H must be chosen so as to maintain the comm/comp ratio < 1, for every test case Even then, small variations of the value of H, do not significantly affect the overall performance.

April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS Conclusions Loops with dependencies can now be dynamically scheduled on heterogeneous dedicated & non-dedicated systems Distributed algorithms efficiently compensate for the system’s heterogeneity for loops with dependencies, especially in non-dedicated systems

April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work

April 27, 2006IPDPS Future work Establish a model for predicting the optimal synchronization interval H and minimize the communication Extend all other self-scheduling algorithms, such that they can handle loops with dependencies and account for system’s heterogeneity

April 27, 2006IPDPS Thank you Questions?