Experimental Comparative Study of Job Management Systems George Washington University George Mason University
Outline: 1.Review of experiments 2.Results 3.Encountered problems 4.Functional comparison 5.Extension to reconfigurable hardware
Review of Experiments
science.gmu.edu Linux – PII, 400 MHz, 128 MB RAM Linux RH7.0 – PIII 450 MHz, 512 MB RAM 4 x Linux RH6.2 – 2xPIII – 500 MHz, 128MB m1 pallj / m0 Solaris 8 – UltraSparcIIi, 360 MHz, 512 MB RAM m4 m5 m7 3 x Linux RH6.2 – 2xPIII – 450 MHz, 128MB Solaris 8 – UltraSparcIIi, 440 MHz, 512 MB RAM Solaris 8 – UltraSparcIIi, 440 MHz, 128 MB RAM Solaris 8 – UltraSparcIIi, 330 MHz, 128 MB RAM palpc2 alicja anna magdalena redfox gmu.edu Our Testbed
SHORT JOBS (1 s execution time 2 minutes) * benchmarks used to determine the relative CPU factors of execution hosts SHORT JOBS (1 s execution time 2 minutes)
Machine namesHost TypeHost Model CPU Factor m1-m4LinuxPIII_2_500_ m5-m7LinuxPIII_2_450_ palljLinuxPIII_1_450_ palpc2LinuxP2_1_400_ alicjaSolaris64USIIi_1_360_ annaSolaris64USIIi_1_440_ magdalenaSolaris64USIIi_1_440_ redfoxSolaris64USIIi_1_330_ CPU factors for medium benchmark list based on the execution time for bt.W and Sobel1024i
MEDIUM JOBS (2 minutes execution time 10 minutes) * benchmarks used to determine the relative CPU factors of execution hosts
LONG JOBS (10 minutes execution time 30 minutes) * benchmarks used to determine the relative CPU factors of execution hosts
INPUT/OUTPUT JOBS (1 second execution time 10 minutes)
Typical experiment time Job submissions time 1 N i1i1 iNiN time=0 Jobs finishing execution Total time of an experiment 2 hours N= 150 for medium and small jobs 75 for long jobs Pseudorandom delays between consecutive job submissions Poisson distribution of the job submission rate
List of experiments
time t s submission time t b begin of execution time t e end of execution time t d delivery time T R response time T TA turn around time T EXE execution time T D delivery time Definition of timing parameters
time t s submission time t b begin of execution time t e end of execution time T R response time T TA turn around time T EXE execution time T D =0 delivery time=0 Typical scenario determined using the gettimeofday() function
Total Throughput time Job submissions time 1 N i1i1 iNiN time=0 Jobs finishing execution T N – time necessary to execute N jobs Total Throughput = N TNTN
Partial Throughput time Job submissions time 1 N i1i1 iNiN time=0 Jobs finishing execution T k – time necessary to execute k jobs Throughput (k) = k TkTk ikik
Utilization
Results
jobs/min 4 jobs/min 12 jobs/min Average job submission rate Medium jobs – Total Throughput Throughput [jobs/hour] LSF PBS Codine Condor
jobs/min 4 jobs/min 12 jobs/min Medium jobs – Turn-around Time LSF PBS Codine Condor Average job submission rate Turn-around Time [s]
jobs/min 4 jobs/min 12 jobs/min Average job submission rate Medium jobs – Response Time Response Time [s] LSF PBS Codine Condor
jobs/min 4 jobs/min 12 jobs/min Average job submission rate Medium jobs – Utilization Utilization [%] LSF PBS Codine Condor
job/min 2 jobs/min Average job submission rate Long jobs – Total Throughput Throughput [jobs/hour] LSF PBS Codine Condor
job/min 2 jobs/min Average job submission rate Long jobs – Turn-around Time Turn-around Time [s] LSF PBS Codine Condor
job/min 2 jobs/min Average job submission rate Long jobs – Response Time Response Time [s] LSF PBS Codine Condor
job/min 2 jobs/min Average job submission rate Long jobs – Utilization Utilization [%] LSF PBS Codine Condor
jobs/min6 jobs/min12 jobs/min30 jobs/min60 jobs/min Average job submission rate Short jobs – Total Throughput Throughput [jobs/hour] LSF PBS Codine Condor
jobs/min6 jobs/min12 jobs/min30 jobs/min60 jobs/min Average job submission rate LSF PBS Codine Condor Short jobs – Turn-around Time Turn-around Time [s]
jobs/min6 jobs/min12 jobs/min30 jobs/min60 jobs/min Average job submission rate LSF PBS Codine Condor Short jobs – Response Time Response Time [s]
jobs/min6 jobs/min12 jobs/min30 jobs/min60 jobs/min Average job submission rate LSF PBS Codine Condor Short jobs – Utilization Utilization [%]
Medium jobs – Total Throughput Throughput [jobs/hour] job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min Maximum number of jobs per CPU LSF PBS Codine Condor
Medium jobs – Turn-around Time Turn-around Time [s] job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min Maximum number of jobs per CPU LSF PBS Codine Condor
Medium jobs – Response Time Response Time [s] job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min Maximum number of jobs per CPU LSF PBS Codine Condor
Medium jobs – Utilization Utilization [%] job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min Maximum number of jobs per CPU LSF PBS Codine Condor
Encountered problems
1. Jobs with high requirements on the stack size Indication: Certain jobs do not finish execution when run under LSF. The same jobs run correctly outside of any JMS, and under other job management systems Source: Variable STACKLIMIT in $LSB_CONFDIR/ /configdir/lsb.queues Remaining Problem: Documentation of default limits.
2. Frequently submitted small jobs Indication: Unexpectedly high response time and turn-around time for a medium job submission rate Possible solution: Defining variable CHUNK_JOB_SIZE (e.g., =5) in lsb.queues, and the variable LSB_CHUNK_NORUSAGE=y in lsf.conf
3. Ordering of machines fulfilling resource requirements Question: How many machines are dropped from the list based on the first ordering? Default: r1m : pg
4. Random behavior from iteration to iteration Question: Why is r1m different each time? Indication: Assignment of jobs to particular machines is different in each iteration of the experiment
5. Boundary effects in the calculation of the throughput Question: How to define the steady state throughput? Indication: Steady state partial throughput different than the total throughput
6. Throughput vs. turn-around time Question: How to explain the lack of this correlation? Indication: No correlation between the ranking of JMSes in terms of the throughput and in terms of the turn-around time
Functional comparison
Operating system, flexibility, user interface LSF Codine PBS CONDOR RES Distribution Source code OS Support User Interface Solaris Linux Tru64 NT GUI & CLI com pub pub/compubgov GUI & CLI GUI & CLI GUI & CLI
Scheduling and Resource Management LSF Codine PBS CONDOR RES Batch jobs Interactive jobs Parallel jobs Accounting
Efficiency and Utilization LSF Codine PBS CONDOR RES Stage-in and stage-out Timesharing Process migration Dynamic load balancing Scalability
Fault Tolerance and Security LSF Codine PBS CONDOR RES Checkpointing Daemon fault recovery Authentication Authorization
Documentation and Technical Support LSF Codine PBS CONDOR RES Documentation Technical support
JMS features supporting extension to reconfigurable hardware capability to define new dynamic resources strong support for stage-in and stage-out - configuration bitstreams - executable code - input/output data support for Windows NT and Linux
Ranking of Centralized Job Management Systems (1) Capability to define new dynamic resources: Excellent:LSF, PBS, CODINE More difficult:CONDOR, RES Stage-in and stage-out: Excellent:LSF, PBS Limited:CONDOR No:CODINE, RES
Ranking of Centralized Job Management Systems (2) Overall suitability to extend to reconfigurable hardware: 1.LSF 2.CODINE 3.PBS 4.CONDOR 5.RES without changing the JMS source code requires changes to the JMS source code
Extension to reconfigurable hardware
Submission host LIM Batch API Master host MLIM MBD Execution host SBD Child SBD LIM RES User job Extension of LSF to reconfigurable hardware (1) Operation of LSF LIM – Load Information Manager MLIM – Master LIM MBD – Master Batch Daemon SBD – Slave Batch Daemon RES – Remote Execution Server queue Load information other hosts other hosts bsub app
Extension of LSF to reconfigurable hardware(2) Submission host LIM Batch API Master host MLIM MBD Execution host SBD Child SBD LIM RES User job ELIM – External Load Information Manager ACS API – Adaptive Computing Systems API queue Load information other hosts other hosts bsub app ELIM ACS API 14 FPGA board Status of the board