Download presentation
Presentation is loading. Please wait.
1
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡ and George Papakonstantinou † † National Technical University of Athens Computing Systems Laboratory ‡ University of Texas at San Antonio cflorina@cslab.ece.ntua.gr www.cslab.ece.ntua.gr 20th International Parallel and Distributed Processing Symposium 25-29 April 2006
2
April 27, 2006IPDPS 20062Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
3
April 27, 2006IPDPS 20063Introduction Motivation for dynamically scheduling loops with dependencies: Existing dynamic algorithms can not cope with dependencies, because they lack inter-slave communication Static algorithms are not always efficient In their original form, if dynamic algorithms are applied to loops with dependencies, they yield a serial/invalid execution
4
April 27, 2006IPDPS 20064Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
5
April 27, 2006IPDPS 20065Notation Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies General program statements within the loop body J – index space of an n-dimensional uniform dependence loop
6
April 27, 2006IPDPS 20066Notation u 1 – synchronization dimension, u n – scheduling dimension – set of dependence vectors PE – processing element P 1,...,P m – slaves N – number of scheduling steps C i – chunk size at the i -th scheduling step V i – size (iteration-wise) of C i along scheduling dimension u n VP k – virtual computing power of slave P k Q k – number of processes in the run-queue of slave P k – available computing power of slave P k – total available computing power of the cluster
7
April 27, 2006IPDPS 20067 Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
8
April 27, 2006IPDPS 20068 Some existing self-scheduling algorithms CSS and TSS are devised for homogeneous systems DTSS improves on TSS for heterogeneous systems by selecting the chunk sizes according to: the virtual computational power of the slaves, V k the number of processes in the run-queue of each PE, Q k 3 self-scheduling algorithms: CSS – Chunk Self-Scheduling, C i = constant TSS – Trapezoid Self-Scheduling, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×m) and the last chunk is L = 1. DTSS – Distributed TSS, C i = C i-1 – D, where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1. u1u1 u2u2 V i+1 V i V i-1 V1V1 VNVN... DTSSDTSSS TSSTSSS CSSCSSS C i+1 C i C i-1
9
April 27, 2006IPDPS 20069 Some existing self-scheduling algorithms AlgorithmChunk sizes CSS 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200 TSS 277 270 263 256 249 242 235 228 221 214 207 200 193 186 179 172 165 158 151 144 137 130 123 116 109 102 73 DTSS (dedicated) 392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207 130 183 114 159 98 46 87 41 44 DTSS (non- dedicated) 263 383 369 355 229 112 219 107 209 203 293 279 265 169 33 96 46 89 86 83 80 77 74 24 69 66 31 59 56 53 50 47 44 20 39 20 33 30 27 24 21 20 20 20 20 20 20 20 20 8 | J |=5000×10000 m = 10 slaves CSS and TSS give the same chunk sizes both in dedicated and non- dedicated systems, respectively DTSS adjusts the chunk sizes to match the different A k of slaves
10
April 27, 2006IPDPS 200610 Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
11
April 27, 2006IPDPS 200611 More notation SP – synchronization point M – number of SPs inserted along synchronization dimension u 1 H – interval (iteration-wise) between two SPs along u 1 H – is the same for every chunk SC i,j – the set of iterations of C i between SP j-1 and SP j C i = V i × M × H Current slave – the slave assigned chunk C i Previous slave – the slave assigned chunk C i-1
12
April 27, 2006IPDPS 200612 Self-scheduling with synchronization Chunks are formed along scheduling dimension, here say u 2 SPs are inserted along synchronization dimension, u 1 Phase 1: Apply self-scheduling algorithms to the scheduling dimension Phase 2: Insert synchronization points along the synchronization dimension
13
April 27, 2006IPDPS 200613 The inter-slave communication scheme C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) Afterwards, P k receives from P k-1 the data required for the current computation Slaves do not reach a SP at the same time, which leads to a wavefront execution fashion communication set set of points computed at moment t +1 set of points computed at moment t indicates communication auxiliary explanations P k+1 P k P k-1 SP j C i+1 C i C i-1 SP j+1 SP j+2 SC i,j+1 SC i-1,j+1 t t t +1
14
April 27, 2006IPDPS 200614 Dynamic Multi-Phase Scheduling DMPS(x) INPUT : (a) An n -dimensional dependence nested loop. (b) The choice of the algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size C i. (d) The synchronization interval H. (e) The number of slaves m ; in case of DTSS, the virtual power V k of every slave. Master Master: Initialization: (M.a) Register slaves. In case of DTSS, slaves report their A k. (M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given C i. While there are unassigned iterations do: (M.1) If a request arrives, put it in the queue. (M.2) Pick a request from the queue, and compute the next chunk size using CSS, TSS or DTSS. (M.3) Update the current and previous slave ids. (M.4) Send the id of the current slave to the previous one.
15
April 27, 2006IPDPS 200615 Dynamic Multi-Phase Scheduling DMPS(x) Slave P k Slave P k : Initialization: (S.a) Register with the master. In case of DTSS, report A k. (S.b) Compute M according to the given H. (S.1) Send request to the master. (S.2) Wait for reply; if received chunk from master, go to step 3, else go to OUTPUT. (S.3) While the next SP is not reached, compute chunk i. (S.4) If id of the send-to slave is known, go to step 5, else go to step 6. (S.5) Send computed data to send-to slave (S.6) Receive data from the receive-from slave and go to step 3. OUTPUT Master Master: If there are no more chunks to be assigned, terminate. Slave P k Slave P k : If no more tasks come from master, terminate.
16
April 27, 2006IPDPS 200616 Advantages of DMPS(x) Can take as input any self-scheduling algorithm, without any modifications Phase 2 is independent of Phase 1 Phase 1 deals with the heterogeneity & load variation in the system Phase 2 deals with minimizing the inter-slave communication cost Suitable for any type of heterogeneous systems Dynamic Multi-Phase Scheduling DMPS(x)
17
April 27, 2006IPDPS 200617 Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
18
April 27, 2006IPDPS 200618 Implementation and testing setup The algorithms are implemented in C and C++ MPI platform is used for master-slave and inter-slave communication The heterogeneous system consists of 10 machines: 4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VP k = 1.5 (one of them is the master) 6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids), assumed to have VP k = 0.5. Interconnection network is Fast Ethernet, at 100Mbit/sec. Dedicated system: all machines are dedicated to running the program and no other loads are interposed during the execution. Non-dedicated system: at the beginning of program’s execution, a resource expensive process is started on some of the slaves, halving their A k.
19
April 27, 2006IPDPS 200619 Implementation and testing setup System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. Three series of experiments for both dedicated & non-dedicated systems, for m = 3,4,5,6,7,8,9 slaves: 1)DMPS(CSS) 2)DMPS(TSS) 3)DMPS(DTSS) Two real-life applications: heat equation, Floyd-Steinberg computation Speedup S p is computed with: where T P i – serial execution time on slave P i, 1 ≤ i ≤ m, and T PAR – parallel execution time (on m slaves) In the plotting of S p, VP is used instead of m on the x -axis.
20
April 27, 2006IPDPS 200620 Performance results – Heat equation Sync. interval H Dedicated system Series tested Number of slaves m 3456789 100 1) DMPS(CSS) 2.321.751.731.231.21 1.18 2) DMPS(TSS) 2.201.731.561.381.251.141.02 3) DMPS(DTSS) 1.421.141.000.950.910.850.78 150 1) DMPS(CSS) 2.311.741.711.211.221.211.18 2) DMPS(TSS) 2.181.721.541.381.251.141.02 3) DMPS(DTSS) 1.421.130.990.930.900.840.78 200 1) DMPS(CSS) 2.301.741.731.221.231.221.19 2) DMPS(TSS) 2.211.741.551.381.251.141.02 3) DMPS(DTSS) 1.421.130.990.940.900.830.78
21
April 27, 2006IPDPS 200621 Performance results – Heat equation Sync. interval H Non-dedicated system Series tested Number of slaves m 3456789 100 1) DMPS(CSS) 2.331.761.732.462.452.382.06 2) DMPS(TSS) 2.201.741.562.522.562.182.10 3) DMPS(DTSS) 1.951.451.301.311.331.381.25 150 1) DMPS(CSS) 2.331.741.722.462.492.432.05 2) DMPS(TSS) 2.191.721.542.422.232.312.06 3) DMPS(DTSS) 1.941.471.30 1.281.361.23 200 1) DMPS(CSS) 2.301.741.732.392.362.382.10 2) DMPS(TSS) 2.221.751.561.792.322.102.02 3) DMPS(DTSS) 1.961.441.29 1.271.321.21
22
April 27, 2006IPDPS 200622 Performance results – Floyd-Steinberg Sync. interval H Dedicated system Series tested Number of slaves m 3456789 50 1) DMPS(CSS) 27.7922.1416.7816.6916.5311.3811.36 2) DMPS(TSS) 25.3219.7717.3015.4113.8012.4311.40 3) DMPS(DTSS) 19.6314.8713.2812.7211.5711.4510.73 100 1) DMPS(CSS) 27.5222.0116.7016.6516.4311.3411.33 2) DMPS(TSS) 25.2219.7017.2415.3513.7512.3811.38 3) DMPS(DTSS) 19.6314.8013.2112.6611.5211.3410.64 150 1) DMPS(CSS) 27.5822.0316.7516.7016.4411.43 2) DMPS(TSS) 25.2219.7017.2215.3413.7512.3911.38 3) DMPS(DTSS) 19.6214.8213.2412.6711.5311.3410.65
23
April 27, 2006IPDPS 200623 Performance results – Floyd-Steinberg Sync. interval H Non-dedicated system Series tested Number of slaves m 3456789 50 1) DMPS(CSS) 27.7222.1316.7623.8122.3222.4722.44 2) DMPS(TSS) 25.1819.7217.2422.3424.1422.2620.95 3) DMPS(DTSS) 21.8816.0614.3813.7413.2613.0211.71 100 1) DMPS(CSS) 27.4921.9916.6722.6122.4222.5922.35 2) DMPS(TSS) 25.1819.6617.1719.2324.1522.2420.88 3) DMPS(DTSS) 21.8515.9614.3213.6513.1112.8011.58 150 1) DMPS(CSS) 27.5722.0116.7422.4922.4822.3222.46 2) DMPS(TSS) 25.1719.6517.2026.2024.1422.0220.82 3) DMPS(DTSS) 21.8615.9614.3113.5813.1812.8011.59
24
April 27, 2006IPDPS 200624 Interpretation of the results Dedicated system: as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one. DMPS(TSS) slightly outperforms DMPS(CSS ) for parallel loops, because it provides better load balancing DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity Non-dedicated system: DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations The speedup for DMPS(DTSS) increases in all cases H must be chosen so as to maintain the comm/comp ratio < 1, for every test case Even then, small variations of the value of H, do not significantly affect the overall performance.
25
April 27, 2006IPDPS 200625Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
26
April 27, 2006IPDPS 200626Conclusions Loops with dependencies can now be dynamically scheduled on heterogeneous dedicated & non-dedicated systems Distributed algorithms efficiently compensate for the system’s heterogeneity for loops with dependencies, especially in non-dedicated systems
27
April 27, 2006IPDPS 200627Outline IntroductionIntroduction Notation Some existing self-scheduling algorithms Dynamic self-scheduling for dependence loops Implementation and test results Conclusions Future work
28
April 27, 2006IPDPS 200628 Future work Establish a model for predicting the optimal synchronization interval H and minimize the communication Extend all other self-scheduling algorithms, such that they can handle loops with dependencies and account for system’s heterogeneity
29
April 27, 2006IPDPS 200629 Thank you Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.