Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in Condor
Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
› Goodput = Allocation - Network Overhead Placement Remote I/O Periodic Checkpoint Preemption Checkpoint X
Improving Goodput › Overlap network I/O with computation when possible › Complete synchronous network I/O operations as quickly as possible Make network capacity an allocated resource
Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
Matchmaking Framework: Advertisement Network Manager Customer Agent Compute Server Matchmaker Resource Requests Resource Offers Resource Offers
Matchmaking Framework: Match Notification Network Manager Customer Agent Compute Server Matchmaker
Admission Control in the Matchmaker › Only schedule jobs for which network capacity is available Transfer executable, input, and checkpoint files Transfer files for preempted job
Admission Control in the Matchmaker (cont.) › Some capacity reserved for system goodput Schedule jobs with small network requirements on CPUs that would otherwise go idle because of limited network capacity
Matchmaking Framework: Claiming Network Manager Customer Agent Compute Server Matchmaker
Network Manger › Accepts claims for network resources Schedules placement & preemption xfers Allocates supplemental requests Supports advance reservations › Incorporates feedback into future allocation decisions
Network Scheduling › Start time, end time, min. rate, max. rate › Search forward or backward › First fit, best rate, earliest completion / latest start › Example: scheduling checkpoints before shutdown event
Bandwidth Control › Streams register with network manager and send bandwidth requests › Network manager allocates available bandwidth according to max-min fairness
Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
Job Goodput Statistics: condor_q › % condor_q -goodput -- Submitter: corduroy.ncsa.uiuc.edu : ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s jbasney 3/2 12: :07: % 90.7% jbasney 3/2 12: :48: % 90.5% jbasney 3/2 12: :09: % 91.9% jbasney 3/2 12: :24: % 94.2% 0.15 › % condor_q -io -- Submitter: corduroy.ncsa.uiuc.edu : ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE jbasney 96.0 B KB B /s KB 32.0 KB jbasney 96.0 B 91.4 KB B /s KB 32.0 KB jbasney 96.0 B KB B /s KB 32.0 KB jbasney 96.0 B KB B /s KB 32.0 KB
Job Goodput Statistics: condor_userlog › % condor_userlog -j tot Host/Job Wall Time Good Time CPU Usage Avg Alloc Avg Lost Goodput Util : : : : : % 98.3% : : : : : % 90.1% : : : : :15 0.0% 0.0% : : : : : % 93.4% : : : : :40 0.0% 0.0% : : : : : % 92.5% : : : : : % 93.6% : : : : : % 92.8% : : : : : % 88.6% : : : : : % 95.2% Total 2+16: : : : : % 95.2%
Multiple Checkpoint Servers › Checkpoint faster by writing to a local checkpoint server › CKPT_SERVER_HOST defines the checkpoint server for each machine › Jobs send checkpoints to the server configured at the execution site › Works with flocking
Checkpoint Server Domains › Condor sets LastCkptServer attribute in the Job ClassAd › Job ClassAds use LastCkptServer to request a machine close to their checkpoint
Conclusion › Network overheads reduce the efficiency of CPU allocations › Overlap I/O with computation › Allocate network resources in Condor › Improving goodput in Condor 6.2 condor_q & condor_userlog Multiple checkpoint servers