Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand
Unix Tool Philosophy › 1) Individual tools do one thing well › 2) Communicate via ascii streams › 3) Are composable
The Paradox › Universal assent that it’s good › No one uses it (Except for shell one-liners) grep ^abc| sort | uniq –c | sort –n
More than just shell scripts Division in Unix processes provides: Restartabilty Better security Scalable across multi-core
For example… › Qmail: Secure, stable Implemented across ~dozen processes
Getting back to Condor… › Condor uses this in some places x-Gahp’s condor_master Replaceable shadow/starter pairs Multi_shadow vs. many shadow › But not everywhere schedd
Condor Daemons as Components › Very Successful strategy: Glide-in Personal-condor “Hoffman” and schedd’s as jobs Condor-c
Case Study: MPI on Demand › The problem: Have a pool with lots of machines Very-long running (weeks) vanilla jobs Need to run big, but short MPI Can’t reboot startds › Need Dedicated scheduler Requires dedicated machines
Possible Solutions › Add “suspension slot” Requires Reboot › Submit MPI job normally Preempts vanilla job
COD refresher › COD: Computing On Demand No Scheduling No File Transfer When COD runs, vanilla job suspends “Checkpoint to swap” Needs security on to work Explicitly allowed
Startd as COD job › Overview: › Launch personal condor › Run startds as COD jobs on base pool Report to personal Condor Base jobs suspend › Submit parallel job to personal Condor › Remove COD startds
Startd under COD: Details › Two condor_config files: careful! › COD provides no file transfer Can re-use existing startd binary Need to pre-stage or NFS config_file › Don’t lose claimid!
Example code › HOSTS=“a b c” › For h in hosts do; Condor_cod request –name $h > claimid.$h › For n in claimid.* do; Condor_cod activate –id `cat $n` -jobad ja
Cod JOB_AD › CMD = “/nfs/path/run-startd.sh” › IWD = “/tmp” › Out = “startd.out” › Err = “startd.err” › Universe = 5
Run-startd.sh › Mkdir –p p-condor/{spool,log,execute) › CONDOR_CONFIG=/nfs/new_config › Exec /usr/sbin/condor_master –f -t
Summary › Use condor daemons as components › Mix-and-match as needed
Questions? › Thank You!