Condor Project Computer Sciences Department University of Wisconsin-Madison Master/Worker and Condor Barcelona, 2006
2 Agenda Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs
3 Why Master Worker? MW addresses a weakness in Condor: Short jobs Excellent for dynamic, parallel workflows
4 A Workflow Problem A problem requires that we do A 60,000 times, and we do B 100,000 times A takes 1 second B takes 3 seconds Computation time for the problem is (60000 x 1) + ( x 3) = 360,000 seconds or 100 hours
5 Condor Runs the Workflow Assume that the overhead Condor adds to running each instance of A or B is 20 seconds (this overhead is much too small) Time for Condor to do the problem is (60000 x 21) + ( x 23) = 3,560,000 seconds or 989 hours
6 A Condor Job…
7 Bundle several As or Bs into a single Condor job Must address further issues: Partial failures Load balancing Dynamic creation of work An Often Considered Solution A A A One Condor job
8 Basics of MW The master gives tasks to the workers.
9 Workers and Tasks Each worker serially takes on tasks, as assigned by the master feed me change diaper bathe me one worker
10 Relating MW to Condor There is 1 master The master determines the number of workers Each worker is a Condor job Each worker receives tasks serially Many workers do tasks at the same time (in parallel) Workers communicate only with the master
11 Solution: Lightweight Tasks Multiplexed on top of Jobs The analogy: Process is to Thread as Condor Job is to an MW Task A Condor job may take minutes to create and dispatch; an MWTask dispatch takes milliseconds
12 MW is C++ Framework A way to re-use Condor worker jobs Each worker may run many tasks Results in a very parallel application
13 MW is not MPI (Message Passing Interface) General parallel programming scheme
14 MW in action condor_submit Submit machine Master exe T T T Worker T T T T T
15 You Must Write 3 Classes, the Subclasses of... MWDriver MWTask MWWorker Master exe Worker exe
16 An MWTask Subclass MWTask Data members for inputs Data member for results Serialization of inputs and results Distinct instances on each side
17 The Four Task Methods void MyTask::pack_work(void); void MyTask::unpack_work(void); void MyTask::pack_results(void); void MyTask::unpack_results(void); Also constructors and destructors!
18 RMC Resource Management and Communication An abstraction to set up communication, to specify resource requirements, etc. RMC->pack(int *array, int length); RMC->unpack(int *array, int length);
19 MWWorker Just one method: executeTask(MWTask *t) Also constructor and destructor!
20 MWDriver (the master) get_userinfo(int argc, char **argv) RMC->add_executable(char *exe, char *requirements); setup_initial_tasks(int num_tasks, MWTask ***init_tasks) act_on_completed_task(MWTask *t) RMC->add_task(MWTask *t) Also constructor and destructor
21 MWTask ***init_tasks task array of pointers to tasks pointer to the array
22 MWDriver (the master) get_userinfo(int argc, char **argv) RMC->add_executable(char *exe, char *requirements); setup_initial_tasks(int num_tasks, MWTask ***init_tasks) act_on_completed_task(MWTask *t) RMC->add_task(MWTask *t) Also constructor and destructor
23 Putting it all together: examples/new_skel ./new_app MY_PROJECT A Perl script to create appropriately named files containing skeleton code Use configure –help for options make
24 Running an application Just launch the appropriate master use condor_q to see it in action
25 Real MW Applications MWFATCOP (Chen, Ferris, Linderoth) A branch and cut code for linear integer programming MWMINLP (Goux, Leyffer, Nocedal) A branch and bound code for nonlinear integer programming MWQPBB (Linderoth) A (simplicial) branch and bound code for solving quadratically constrained quadratic programs MWAND (Linderoth, Shen) A nested decomposition based solver for multistage stochastic linear programming MWATR (Linderoth, Shapiro, Wright) A trust-region-enhanced cutting plane code for linear stochastic programming and statistical verification of solution quality. MWQAP (Anstreicher, Brixius, Goux, Linderoth) A branch and bound code for solving the quadratic assignment problem
26 Other resources Online manual MW-users mailing list
27 Extra Slides
28 Advice for Large Runs Use Personal Condor Flock, glidein, schedd-on-side, hobblein Use checkpoints! Set worker_increment high
29 Debugging with Independent Mode Special RMComm for debugging Single process, can run under gdb
30 MW Philosophy Reuse either code or concept Key idea: Late binding
31 User-level Checkpoints MWTask::write_chkpt_info(FILE *) MWTask::read_chkpt_info(FILE *) MWDriver::read_master_state(FILE *) MWDriver::write_master_state(FILE *)
32 Example codes with MW Matmul Blackbox knapsack
33 More on MW Version 0.2 is the latest It is more stable than the version number suggests! Mailing list available for discussion Active development by the Condor team