Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe
Overview › Task vs. Job Parallelism › New Condor support for Task- Parallelism › Other goodies
The Talk in one Slide Parallel Universe can run any * task parallel job Not just MPICH Not just MPI…
Job vs Task Parallelism › Condor historically focused on Job Parallelism › Job parallelism either manually or via DAGman › Rest of talk on task parallelism › Can also get task parallel via pvm or MW
Parallel Universe › Adaptation of MPI universe › Modifications based on experience with MPI › User feedback › But, more than just MPI
MPI lifecycle without Condor › Lam Version 1. lamboot lamboot -ssi boot ssh machine_file 2. mpirun mpirun -np 8 exe arg1 arg lamhalt lamhalt
Scheduling › Need “Dedicated Scheduler” "Dedicated" has a specific Condor meaning Nodes running MPI require a dedicated scheduler A Given machine can have many opportunistic schedulers ... but only 1 dedicated scheduler
DedicatedScheduler surprises › DedicatedScheduler co-opts normal negotiation cycle › Preemption and scheduling work differently than opportunistic › DedicatedScheduler schedules First- Fit, sorted by UserJobPrio › Condor_q –analyze mystery!
Job startup › Same file transfer, etc. as Vanilla › One shadow, many starters › Starter runs sshd on all machines, does key exchange › Starter runs the exe on first machine (head node, Rank0)
Your script Here › Script on the head node has contact file › We provide samples for LAM, MPICH › We try to mimic “by hand” startup › Use condor_ssh to start remote jobs › When script exits, condor cleans up
Parallel Example Submit Machine Execute Machines Schedd Shadow Startd Sshd Script Job starter
Example submit file Universe = Parallel # executable is a script executable = script # the real binary transfer_input_files = executable arguments = arg1 arg2 arg3 machine_count = 8 output = out.$(Cluster).$(NODE) queue
Example Script chmod 755 simple lamboot –ssi boot rsh $MACHINE_FILE mpirun –np $NO_MACHINES simple lamhalt
Example submit file 2 Universe = Parallel Requirements = (Hostname == “somemachine”) queue Requirements = (Hostname != “somemachine”) queue 7
Example Script 2 mach1 = `sed –n 1p $MACHINE_FILE` mach2 = `sed –n 2p $MACHINE_FILE`./server & ssh $mach1 client_app ssh $mach2 client_app wait
Summary › With Parallel Universe in Condor 6.8 comes: › Support for most MPI implementations (some scripting required) › Somewhat better MPI scheduling › Better node placement via condor matchmaking
Questions? › Thank you!