Download presentation
Presentation is loading. Please wait.
Published byFelix Martin Modified over 9 years ago
1
Jim Basney Computer Sciences Department University of Wisconsin-Madison jbasney@cs.wisc.edu http://www.cs.wisc.edu/condor Managing Network Resources in Condor
2
www.cs.wisc.edu/condor Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
3
www.cs.wisc.edu/condor › Goodput = Allocation - Network Overhead Placement Remote I/O Periodic Checkpoint Preemption Checkpoint X
4
www.cs.wisc.edu/condor Improving Goodput › Overlap network I/O with computation when possible › Complete synchronous network I/O operations as quickly as possible Make network capacity an allocated resource
5
www.cs.wisc.edu/condor Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
6
www.cs.wisc.edu/condor Matchmaking Framework: Advertisement Network Manager Customer Agent Compute Server Matchmaker Resource Requests Resource Offers Resource Offers
7
www.cs.wisc.edu/condor Matchmaking Framework: Match Notification Network Manager Customer Agent Compute Server Matchmaker
8
www.cs.wisc.edu/condor Admission Control in the Matchmaker › Only schedule jobs for which network capacity is available Transfer executable, input, and checkpoint files Transfer files for preempted job
9
www.cs.wisc.edu/condor Admission Control in the Matchmaker (cont.) › Some capacity reserved for system goodput Schedule jobs with small network requirements on CPUs that would otherwise go idle because of limited network capacity
10
www.cs.wisc.edu/condor Matchmaking Framework: Claiming Network Manager Customer Agent Compute Server Matchmaker
11
www.cs.wisc.edu/condor Network Manger › Accepts claims for network resources Schedules placement & preemption xfers Allocates supplemental requests Supports advance reservations › Incorporates feedback into future allocation decisions
12
www.cs.wisc.edu/condor Network Scheduling › Start time, end time, min. rate, max. rate › Search forward or backward › First fit, best rate, earliest completion / latest start › Example: scheduling checkpoints before shutdown event
13
www.cs.wisc.edu/condor Bandwidth Control › Streams register with network manager and send bandwidth requests › Network manager allocates available bandwidth according to max-min fairness
14
www.cs.wisc.edu/condor Outline › Introduction: Improving Goodput › Research Overview: Making Network a Condor-Managed Resource › Improving Goodput in Condor 6.2 › Conclusion
15
www.cs.wisc.edu/condor Job Goodput Statistics: condor_q › % condor_q -goodput -- Submitter: corduroy.ncsa.uiuc.edu : ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 870.0 jbasney 3/2 12:22 2+18:07:11 79.1% 90.7% 0.20 870.3 jbasney 3/2 12:22 2+23:48:58 98.9% 90.5% 0.11 870.4 jbasney 3/2 12:22 2+22:09:25 91.1% 91.9% 0.19 870.5 jbasney 3/2 12:22 2+23:24:56 86.3% 94.2% 0.15 › % condor_q -io -- Submitter: corduroy.ncsa.uiuc.edu : ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE 870.0 jbasney 96.0 B 160.0 KB 0 0.0 B /s 512.0 KB 32.0 KB 870.3 jbasney 96.0 B 91.4 KB 0 0.0 B /s 512.0 KB 32.0 KB 870.4 jbasney 96.0 B 130.2 KB 0 0.0 B /s 512.0 KB 32.0 KB 870.5 jbasney 96.0 B 130.6 KB 0 0.0 B /s 512.0 KB 32.0 KB
16
www.cs.wisc.edu/condor Job Goodput Statistics: condor_userlog › % condor_userlog -j 870.21 -tot Host/Job Wall Time Good Time CPU Usage Avg Alloc Avg Lost Goodput Util. 141.142.220.8 0+23:52 0+23:52 0+23:27 0+03:58 0+00:00 100.0% 98.3% 141.142.220.107 0+00:29 0+00:29 0+00:26 0+00:29 0+00:00 100.0% 90.1% 141.142.221.225 0+01:02 0+00:00 0+00:00 0+00:15 0+00:15 0.0% 0.0% 141.142.220.109 0+11:32 0+10:00 0+09:20 0+05:46 0+00:46 86.6% 93.4% 141.142.7.7 0+00:40 0+00:00 0+00:00 0+00:40 0+00:40 0.0% 0.0% 141.142.220.104 0+06:58 0+06:00 0+05:33 0+06:58 0+00:57 86.2% 92.5% 141.142.220.18 0+14:33 0+14:02 0+13:08 0+14:33 0+00:31 96.4% 93.6% 141.142.220.99 0+03:35 0+03:00 0+02:47 0+03:35 0+00:34 83.8% 92.8% 141.142.220.13 0+01:29 0+01:01 0+00:54 0+01:29 0+00:27 68.7% 88.6% 870.21 2+16:59 2+10:26 2+07:38 0+03:25 0+00:32 89.9% 95.2% Total 2+16:59 2+10:26 2+07:38 0+03:25 0+00:32 89.9% 95.2%
17
www.cs.wisc.edu/condor Multiple Checkpoint Servers › Checkpoint faster by writing to a local checkpoint server › CKPT_SERVER_HOST defines the checkpoint server for each machine › Jobs send checkpoints to the server configured at the execution site › Works with flocking
18
www.cs.wisc.edu/condor Checkpoint Server Domains › Condor sets LastCkptServer attribute in the Job ClassAd › Job ClassAds use LastCkptServer to request a machine close to their checkpoint
19
www.cs.wisc.edu/condor Conclusion › Network overheads reduce the efficiency of CPU allocations › Overlap I/O with computation › Allocate network resources in Condor › Improving goodput in Condor 6.2 condor_q & condor_userlog Multiple checkpoint servers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.