Download presentation
Presentation is loading. Please wait.
Published byAllison Hallas Modified over 10 years ago
1
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University of Houston CCGrid, May 2005
2
CGrid 2005, slide 2 Scheduling Parallel Threads Space Sharing/Gang Scheduling All parallel threads of an application scheduled together by a global scheduler Independent Scheduling Threads scheduled independently on each node of a parallel system by the local scheduler
3
CGrid 2005, slide 3 Space Sharing and Gang Scheduling T1 T2 T3 T4 T5 T6 N1 N2 N3 N4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 Gang scheduling T1 T2 T3 T4 T5 T6 N1 N2 N3 N4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 Time slice Nodes Space sharing Threads of application A are a1, a2, a3, a4 Threads of application B are b1, b2, b3, b4
4
CGrid 2005, slide 4 Independent Scheduling and Gang Scheduling T1 T2 T3 T4 T5 T6 N1 N2 N3 N4 a1 a2 a3 a4 b1 b2 a3 b4 a1 a2 b3 a4 b1 b2 b3 a4 a1 a2 b3 b4 b1 b2 a3 b4 Gang scheduling T1 T2 T3 T4 T5 T6 N1 N2 N3 N4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 Time slice Nodes Independent Scheduling
5
CGrid 2005, slide 5 Gang versus Independent Scheduling Gang scheduling is de-facto standard for parallel computation clusters How does independent scheduling compare ? + More flexible – no central scheduler required + Potentially uses resources more efficiently - Potentially increases synchronization overhead
6
CGrid 2005, slide 6 Synchronization/Communication with Independent Scheduling T1 T2 T3 T4 T5 T6 N1 N2 N3 N4 a1 a2 b3 b4 b1 b2 a3 a4 a1 a2 b3 b4 b1 b2 a3 a4 a1 a2 b3 b4 b1 b2 a3 a4 Time slice Nodes With strict independent round robin scheduling parallel threads may never be able to communicate! Fortunately scheduling is never strictly round robin, but this is a significant performance issue
7
CGrid 2005, slide 7 Research in This Paper How does node sharing with independent scheduling perform in practice ? Improved resource utilization versus higher synchronization overhead ? Dependence on application characteristics ? Dependence on CPU time slice values ?
8
CGrid 2005, slide 8 Experiments All experiments with NAS benchmarks on 2 clusters Benchmark programs executed: 1.Dedicated mode on a cluster 2.With node sharing with competing applications 3.Slowdown due to sharing analyzed Above experiments conducted with –Various node and thread counts –Various CPU time slice values
9
CGrid 2005, slide 9 Experimental Setup Two clusters are used: 1.10 node, 1 GB RAM, dual Pentium Xeon processors, RedHat Linux 7.2, GigE interconnect 2.18 node 1 GB RAM, dual AMD Athlon processors, RedHat Linux 7.3, GigE interconnect NAS Parallel Benchmarks 2.3, Class B MPI Versions CG, EP, IS, LU, MP compiled for 4, 8,16, 32 threads SP and BT compiled for 4, 9, 16, 36 threads IS (Integer Sort) and CG (Conjugate Gradient) are most communication intensive benchmarks. EP(Embarassingly Parallel) has no communication.
10
CGrid 2005, slide 10 Experimental # 1 NAS Benchmarks compiled for 4, 8/9 and 16 threads 1.Benchmarks first executed in dedicated mode with one thread per node 2.Then executed with 2 additional competing threads on each node Each node has 2 CPUs – minimum 3 total threads are needed to cause contention Competing load threads are simple compute loops with no communication 3.Slowdown (%age increase in execution time) plotted Nominal slowdown is 50% - used for comparison as gang scheduling slowdown
11
CGrid 2005, slide 11 Results: 10 node cluster 0 10 20 30 40 50 60 70 80 CGEPISLUMGSPBTAvg Benchmark Percentage Slowdown 4 nodes 8/9 nodes Expected slowdown with gang scheduling Slowdown ranges around 50% Some increase in slowdown going from 4 to 8 nodes
12
CGrid 2005, slide 12 Results: 18 node cluster Broadly similar Slow increase in slowdown from 4 to 16 nodes SlowdownSlowdown
13
CGrid 2005, slide 13 Remarks Why is slowdown not much higher ? Scheduling is not strict round robin – a blocked application thread will get scheduled again on message arrival leads to self synchronization - threads of the same application across nodes get scheduled together Applications often have significant wait times that are used by competing applications with sharing Increase in slowdown with more nodes is expected as communication operations are more complex The rate of increase is modest
14
CGrid 2005, slide 14 Experiment # 2 Similar to the previous batch of experiments, except… 2 Application threads per node 1 load thread per node Nominal slowdown is still 50%
15
CGrid 2005, slide 15 Performance: 1 and 2 app threads/node 0 10 20 30 40 50 60 70 80 CGEPISLUMGSPBTAvg Percentage Slowdown 1 app thread per node, 4 nodes 1 app thread per node, 8/9 nodes 2 app threads per node, 4/5 nodes 2 app threads per node, 8 nodes Expected slowdown with gang scheduling Slowdown is lower for 2 threads/node
16
CGrid 2005, slide 16 Performance: 1 and 2 app threads/node 0 10 20 30 40 50 60 70 80 CGEPISLUMGSPBTAvg Percentage Slowdown 1 app thread per node, 4 nodes 1 app thread per node, 8/9 nodes 2 app threads per node, 4/5 nodes 2 app threads per node, 8 nodes Slowdown is lower for 2 threads/node competing with one 100% compute thread (not 2) scaling a fixed size problem to more threads means each thread uses CPU less efficiently hence more free cycles available
17
CGrid 2005, slide 17 Experiment # 3 Similar to the previous batch of experiments, except… CPU time slice quantum varied from 30 to 200 ms. (default was 50 msecs) CPU time slice quantum is the amount of time a process gets when others are waiting in ready queue Intuitively, longer time slice quantum means a communication operation between nodes is less likely to be interrupted due to swapping – good a node may have to wait longer for a peer to be scheduled, before communicating - bad
18
CGrid 2005, slide 18 Performance with different CPU time slice quanta 0 10 20 30 40 50 60 70 80 90 100 CGEPISLUMGSP BT Percentage Slowdown CPU time slice=30 ms CPU time slice=50 ms CPU time slice=100 ms CPU time slice=200 ms Small time slices are uniformly bad Medium time slices (50 ms and 100 ms) generally good Longer time slice good for communication intensive codes
19
CGrid 2005, slide 19 Conclusions Performance with independent scheduling competitive with gang scheduling for small clusters. –Key is passive self synchronization of application threads across the cluster Steady but slow increase in slowdown with larger number of nodes Given the flexibility of independent scheduling, it may be a good choice for some scenarios
20
CGrid 2005, slide 20 Broader Picture: Distributed Applications on Networks: Resource selection, Mapping, Adapting Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network Which nodes offer best performance
21
CGrid 2005, slide 21 End of Talk! FOR MORE INFORMATION: www.cs.uh.edu/~jaspalwww.cs.uh.edu/~jaspal jaspal@uh.edujaspal@uh.edu
22
CGrid 2005, slide 22 Mapping Distributed Applications on Networks: “state of the art” Data Sim 1 Sim 2 Stream Model Pre Vis Mapping for Best Performance 1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS, Remos) 2.Find “best” nodes for execution based on network status But the approach has significant limitations… Knowing network status is not the same as knowing how an application will perform Frequent measurements are expensive, less frequent measurements mean stale data
23
CGrid 2005, slide 23 Discovered Communication Structure of NAS Benchmarks 0 1 3 2 BT 0 1 3 2 CG 0 1 3 IS 0 1 3 2 EP 0 1 3 2 LU 0 1 3 2 MG 0 1 3 2 SP 2
24
CGrid 2005, slide 24 CPU Behavior of NAS Benchmarks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.