Download presentation
Presentation is loading. Please wait.
Published byMelvyn Perry Modified over 9 years ago
1
JSSPP-11, Boston, MA June 19, 2005 1 Pitfalls in Parallel Job Scheduling Evaluation Designing Parallel Operating Systems using Modern Interconnects Pitfalls in Parallel Job Scheduling Evaluation Eitan Frachtenberg and Dror Feitelson Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world
2
JSSPP-11, Boston, MA June 19, 2005 2 Pitfalls in Parallel Job Scheduling Evaluation Scope Numerous methodological issues occur with the evaluation of parallel job schedulers: Experiment theory and design Workloads and applications Implementation issues and assumptions Metrics and statistics Paper covers 32 recurring pitfalls, organized into topics and sorted by severity Talk will describe a real case study, and the heroic attempts to avoid most such pitfalls …as well as the less-heroic oversight of several others
3
JSSPP-11, Boston, MA June 19, 2005 3 Pitfalls in Parallel Job Scheduling Evaluation Evaluation Paths Theoretical Analysis (queuing theory): Reproducible, rigorous, and resource-friendly Reproducible, rigorous, and resource-friendly û Hard for time slicing due to unknown parameters, application structure, and feedbacks Simulation: Relatively simple and flexible Relatively simple and flexible û Many assumptions, not all known/reported; hard to reproduce; rarely factors application characteristics Experiments with real sites and workloads: Most representative (at least locally) Most representative (at least locally) û Largely impractical and irreproducible Emulation
4
JSSPP-11, Boston, MA June 19, 2005 4 Pitfalls in Parallel Job Scheduling Evaluation Emulation Environment Experimental platform consisting of three clusters with high-end network Software: several job scheduling algorithms implemented on top of STORM: Batch / space sharing, with optional EASY backfilling Gang Scheduling, Implicit Coscheduling (SB), Flexible Coscheduling Results described in [JSSPP’03] and [TPDS’05]
5
JSSPP-11, Boston, MA June 19, 2005 5 Pitfalls in Parallel Job Scheduling Evaluation Step One: Choosing Workload Static vs. Dynamic Size of workload How many different workloads are needed? Use trace data? Different sites have different workload characteristics Inconvenient sizes may require imprecise scaling “Polluted” data, flurries Use model-generated data? Several models exist, with different strengths By trying to capture everything, may capture nothing
6
JSSPP-11, Boston, MA June 19, 2005 6 Pitfalls in Parallel Job Scheduling Evaluation Static Workloads We start with a synthetic application & static workloads Simple enough to model, debug, and calibrate Bulk-synchronous application Can control: granularity, variability and Communication pattern
7
JSSPP-11, Boston, MA June 19, 2005 7 Pitfalls in Parallel Job Scheduling Evaluation Synthetic Scenarios Balanced Complementing Imbalanced Mixed
8
JSSPP-11, Boston, MA June 19, 2005 8 Pitfalls in Parallel Job Scheduling Evaluation Example: Turnaround Time
9
JSSPP-11, Boston, MA June 19, 2005 9 Pitfalls in Parallel Job Scheduling Evaluation Dynamic Workloads We chose Lublin’s model [JPDC’03] 1000 jobs per workload Multiplying run-times AND arrival times by constant to “shrink” run time (2-4 hours) Shrinking too much is problematic (system constants) Multiplying arrival times by a range of factors to modify load Unrepresentative, since deviates from “real” correlations with run times and job sizes. Better solution is to use different workloads
10
JSSPP-11, Boston, MA June 19, 2005 10 Pitfalls in Parallel Job Scheduling Evaluation Synthetic applications are easy to control, but: Some characteristics are ignored (e.g., I/O, memory) Others may not be representative, in particular communication, which is salient of parallel apps. Granularity, pattern, network performance If not sure, conduct sensitivity analysis Might be assumed malleable, moldable, or with linear speedup, which many MPI applications are not Real applications have no hidden assumptions But may also have limited generality Step Two: Choosing Applications
11
JSSPP-11, Boston, MA June 19, 2005 11 Pitfalls in Parallel Job Scheduling Evaluation Example: Sensitivity Analysis
12
JSSPP-11, Boston, MA June 19, 2005 12 Pitfalls in Parallel Job Scheduling Evaluation Application Choices Synthetic applications on first set Allows control over more parameters Allows testing unrealistic but interesting conditions (e.g., high multiprogramming level) LANL applications on second set (Sweep3D, Sage) Real memory and communication use (MPL=2) Important applications for LANL’s evaluations But probably only for LANL… Runtime estimate: f-model on batch, MPL on others
13
JSSPP-11, Boston, MA June 19, 2005 13 Pitfalls in Parallel Job Scheduling Evaluation Step Three: Choosing Parameters What are reasonable input parameters to use in the evaluation? Maximum multiprogramming level (MPL) Timeslice quantum Input load Backfilling method and effect on multiprogramming Run time estimate factor (not tested) Algorithm constants, tuning, etc.
14
JSSPP-11, Boston, MA June 19, 2005 14 Pitfalls in Parallel Job Scheduling Evaluation Example 1: MPL Verified with different offered loads
15
JSSPP-11, Boston, MA June 19, 2005 15 Pitfalls in Parallel Job Scheduling Evaluation Example 2: Timeslice Dividing to quantiles allows analysis of effect on different job types
16
JSSPP-11, Boston, MA June 19, 2005 16 Pitfalls in Parallel Job Scheduling Evaluation Considerations for Parameters Realistic MPLs Scaling traces to different machine sizes Scaling offered load Artificial user estimates and multiprogramming estimates
17
JSSPP-11, Boston, MA June 19, 2005 17 Pitfalls in Parallel Job Scheduling Evaluation Step Four: Choosing Metrics Not all metrics are easily comparable: Absolute times, slowdown with time slicing, etc. Metrics may need to be limited to a relevant context Use multiple metrics to understand characteristics Measuring utilization for an open model Direct measure of offered load till saturation Same goes for throughput and makespan Better metrics: slowdown, response time, wait time Using mean with asymmetric distributions Inferring scalability from O(1) nodes
18
JSSPP-11, Boston, MA June 19, 2005 18 Pitfalls in Parallel Job Scheduling Evaluation Example: Bounded Slowdown
19
JSSPP-11, Boston, MA June 19, 2005 19 Pitfalls in Parallel Job Scheduling Evaluation Example (continued)
20
JSSPP-11, Boston, MA June 19, 2005 20 Pitfalls in Parallel Job Scheduling Evaluation Response Time
21
JSSPP-11, Boston, MA June 19, 2005 21 Pitfalls in Parallel Job Scheduling Evaluation Bounded Slowdown
22
JSSPP-11, Boston, MA June 19, 2005 22 Pitfalls in Parallel Job Scheduling Evaluation Step Five: Measurement Never measure saturated workloads When arrival rate is higher than service rate, queues grow to infinity; all metrics become meaningless …but finding saturation point can be tricky Discard warm-up and cool-down results May need to measure subgroups separately (long/short, day/night, weekday/weekend,…) Measurement should still have enough data points for statistical meaning, especially workload length
23
JSSPP-11, Boston, MA June 19, 2005 23 Pitfalls in Parallel Job Scheduling Evaluation Example: Saturation Point
24
JSSPP-11, Boston, MA June 19, 2005 24 Pitfalls in Parallel Job Scheduling Evaluation Example: Shortest jobs CDF
25
JSSPP-11, Boston, MA June 19, 2005 25 Pitfalls in Parallel Job Scheduling Evaluation Example: Longest jobs CDF
26
JSSPP-11, Boston, MA June 19, 2005 26 Pitfalls in Parallel Job Scheduling Evaluation Conclusion Parallel Job Scheduling Evaluation is complex …but we can avoid past mistakes Paper can be used as a checklist to work with when designing and executing evaluations Additional information in paper: Pitfalls, examples, and scenarios Suggestions on how to avoid pitfalls Open research questions (for next JSSPP?) Many references to positive examples Be cognizant when Choosing your compromises
27
JSSPP-11, Boston, MA June 19, 2005 27 Pitfalls in Parallel Job Scheduling Evaluation References Workload archive: http://www.cs.huji.ac.il/~feit/worklad Contains several workload traces and models Dror’s publication page http://www.cs.huji.ac.il/~feit/pub.html Eitan’s publication page http://www.cs.huji.ac.il/~etcs/pubs Email: eitanf@lanl.gov
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.