Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡ and George Papakonstantinou † † National Technical University of Athens Computing Systems Laboratory ‡ University of Texas at San Antonio 20th International Parallel and Distributed Processing Symposium April 2006
April 27, 2006IPDPS 20062Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS 20063Introduction Motivation for dynamically scheduling loops with dependencies: Existing dynamic algorithms can not cope with dependencies, because they lack inter-slave communication Static algorithms are not always efficient In their original form, if dynamic algorithms are applied to loops with dependencies, they yield a serial/invalid execution
April 27, 2006IPDPS 20064Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS 20065Notation Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies General program statements within the loop body J – index space of an n-dimensional uniform dependence loop
April 27, 2006IPDPS 20066Notation u 1 – synchronization dimension, u n – scheduling dimension – set of dependence vectors PE – processing element P 1,...,P m – slaves N – number of scheduling steps C i – chunk size at the i -th scheduling step V i – size (iteration-wise) of C i along scheduling dimension u n VP k – virtual computing power of slave P k Q k – number of processes in the run-queue of slave P k – available computing power of slave P k – total available computing power of the cluster
April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS Some existing self-scheduling algorithms CSS and TSS are devised for homogeneous systems DTSS improves on TSS for heterogeneous systems by selecting the chunk sizes according to: the virtual computational power of the slaves, V k the number of processes in the run-queue of each PE, Q k 3 self-scheduling algorithms: CSS – Chunk Self-Scheduling, C i = constant TSS – Trapezoid Self-Scheduling, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×m) and the last chunk is L = 1. DTSS – Distributed TSS, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1. u1u1 u2u2 V i+1 V i V i-1 V1V1 VNVN... DTSSDTSSS TSSTSSS CSSCSSS C i+1 C i C i-1
April 27, 2006IPDPS Some existing self-scheduling algorithms AlgorithmChunk sizes CSS TSS DTSS (dedicated) DTSS (non- dedicated) | J |=5000×10000 m = 10 slaves CSS and TSS give the same chunk sizes both in dedicated and non- dedicated systems, respectively DTSS adjusts the chunk sizes to match the different A k of slaves
April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS More notation SP – synchronization point M – number of SPs inserted along synchronization dimension u 1 H – interval (iteration-wise) between two SPs along u 1 H – is the same for every chunk SC i,j – the set of iterations of C i between SP j-1 and SP j C i = V i × M × H Current slave – the slave assigned chunk C i Previous slave – the slave assigned chunk C i-1
April 27, 2006IPDPS Self-scheduling with synchronization Chunks are formed along scheduling dimension, here say u 2 SPs are inserted along synchronization dimension, u 1 Phase 1: Apply self-scheduling algorithms to the scheduling dimension Phase 2: Insert synchronization points along the synchronization dimension
April 27, 2006IPDPS The inter-slave communication scheme C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) Afterwards, P k receives from P k-1 the data required for the current computation Slaves do not reach a SP at the same time, which leads to a wavefront execution fashion communication set set of points computed at moment t +1 set of points computed at moment t indicates communication auxiliary explanations P k+1 P k P k-1 SP j C i+1 C i C i-1 SP j+1 SP j+2 SC i,j+1 SC i-1,j+1 t t t +1
April 27, 2006IPDPS Dynamic Multi-Phase Scheduling DMPS(x) INPUT : (a) An n -dimensional dependence nested loop. (b) The choice of the algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size C i. (d) The synchronization interval H. (e) The number of slaves m ; in case of DTSS, the virtual power V k of every slave. Master Master: Initialization: (M.a) Register slaves. In case of DTSS, slaves report their A k. (M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given C i. While there are unassigned iterations do: (M.1) If a request arrives, put it in the queue. (M.2) Pick a request from the queue, and compute the next chunk size using CSS, TSS or DTSS. (M.3) Update the current and previous slave ids. (M.4) Send the id of the current slave to the previous one.
April 27, 2006IPDPS Dynamic Multi-Phase Scheduling DMPS(x) Slave P k Slave P k : Initialization: (S.a) Register with the master. In case of DTSS, report A k. (S.b) Compute M according to the given H. (S.1) Send request to the master. (S.2) Wait for reply; if received chunk from master, go to step 3, else go to OUTPUT. (S.3) While the next SP is not reached, compute chunk i. (S.4) If id of the send-to slave is known, go to step 5, else go to step 6. (S.5) Send computed data to send-to slave (S.6) Receive data from the receive-from slave and go to step 3. OUTPUT Master Master: If there are no more chunks to be assigned, terminate. Slave P k Slave P k : If no more tasks come from master, terminate.
April 27, 2006IPDPS Advantages of DMPS(x) Can take as input any self-scheduling algorithm, without any modifications Phase 2 is independent of Phase 1 Phase 1 deals with the heterogeneity & load variation in the system Phase 2 deals with minimizing the inter-slave communication cost Suitable for any type of heterogeneous systems Dynamic Multi-Phase Scheduling DMPS(x)
April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS Implementation and testing setup The algorithms are implemented in C and C++ MPI platform is used for master-slave and inter-slave communication The heterogeneous system consists of 10 machines: 4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VP k = 1.5 (one of them is the master) 6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids), assumed to have VP k = 0.5. Interconnection network is Fast Ethernet, at 100Mbit/sec. Dedicated system: all machines are dedicated to running the program and no other loads are interposed during the execution. Non-dedicated system: at the beginning of program’s execution, a resource expensive process is started on some of the slaves, halving their A k.
April 27, 2006IPDPS Implementation and testing setup System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. Three series of experiments for both dedicated & non-dedicated systems, for m = 3,4,5,6,7,8,9 slaves: 1)DMPS(CSS) 2)DMPS(TSS) 3)DMPS(DTSS) Two real-life applications: heat equation, Floyd-Steinberg computation Speedup S p is computed with: where T P i – serial execution time on slave P i, 1 ≤ i ≤ m, and T PAR – parallel execution time (on m slaves) In the plotting of S p, VP is used instead of m on the x -axis.
April 27, 2006IPDPS Performance results – Heat equation Sync. interval H Dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)
April 27, 2006IPDPS Performance results – Heat equation Sync. interval H Non-dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)
April 27, 2006IPDPS Performance results – Floyd-Steinberg Sync. interval H Dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)
April 27, 2006IPDPS Performance results – Floyd-Steinberg Sync. interval H Non-dedicated system Series tested Number of slaves m ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS) ) DMPS(CSS) ) DMPS(TSS) ) DMPS(DTSS)
April 27, 2006IPDPS Interpretation of the results Dedicated system: as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one. DMPS(TSS) slightly outperforms DMPS(CSS ) for parallel loops, because it provides better load balancing DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity Non-dedicated system: DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations The speedup for DMPS(DTSS) increases in all cases H must be chosen so as to maintain the comm/comp ratio < 1, for every test case Even then, small variations of the value of H, do not significantly affect the overall performance.
April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS Conclusions Loops with dependencies can now be dynamically scheduled on heterogeneous dedicated & non-dedicated systems Distributed algorithms efficiently compensate for the system’s heterogeneity for loops with dependencies, especially in non-dedicated systems
April 27, 2006IPDPS Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
April 27, 2006IPDPS Future work Establish a model for predicting the optimal synchronization interval H and minimize the communication Extend all other self-scheduling algorithms, such that they can handle loops with dependencies and account for system’s heterogeneity
April 27, 2006IPDPS Thank you Questions?