Condor Project Computer Sciences Department University of Wisconsin-Madison Condor: A Project and a System Scientific Data Intensive Computing Workshop 04 Microsoft Research May 2004
2 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data!
3 The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
4 The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a heterogeneous distributed environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
5 A Multifaceted Project Harnessing the power of clusters - opportunistic and/or dedicated (Condor) Job management services for Grid applications (Condor-G, Stork) Fabric management services for Grid resources (Condor, GlideIns, NeST) Distributed I/O technology (Parrot, Kangaroo, NeST) Job-flow management (DAGMan, Condor, Hawk) Distributed monitoring and management (HawkEye) Technology for Distributed Systems (ClassAD, MW) Packaging and Integration (NMI, VDT)
6 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data!
7 What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed fault-tolerant high- throughput computing (HTC) facility. Distributed Ownership: decrease in cost- performance ratio caused Huge increase in organization aggregate computing capacity Much smaller increase in the capacity accessible by a single person HTC Large amounts of processing capacity sustained over very long time periods
8 Condor can manage a large number of jobs Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000s), the data, etc. Condor can handle work flow / inter-job dependencies (DAGMan) Condor users can set job priorities Condor administrators can set user priorities
9 Condor can manage Dedicated Resources… Dedicated Resources Compute Clusters Manage Node monitoring, scheduling Job launch, monitor & cleanup
10 …and Condor can manage non-dedicated resources Non-dedicated resources examples: Desktop workstations in offices Workstations in student labs Non-dedicated resources are often idle --- ~70% of the time! Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources
11 Some HTC Challenges Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Dont have your software installed Are frequently needed by others Are far away & managed by someone else
12 The Condor System Unix and Win2k/XP Operational since 1986 Just at UW: more than 1800 CPUs in 10 pools on our campus Software available free on the web Open license Adopted by the real world (Galileo, Maxtor, Micron, Oracle, Tigr, Xerox, NASA, Texas Instruments, … )
13 Downloads and Deployments
14
15 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data!
16 Recipe Tip: Useful Distributed Ownership mechanisms in Condor Checkpoint / Migration Checkpoint == picture of process state Enables preempt/resume scheduling and migration, ensures forward progress Remote System Calls Redirect I/O and other system calls back to the submit machine. Matchmaking with ClassAds
17 ClassAds Set of bindings of Attribute Names to Expressions Self-describing (no separate schema) Combine query and data Arbitrarily composed and nested Bilateral Resource owners are generous if it doesnt cost them anything!
18 Examples [ Type= "Job"; Owner= "raman"; Cmd= "run_sim"; Args= "-Q "; Cwd= "/u/raman"; Memory= 31; Qdate= ;... Rank= other.Kflops... Requirements= other.Type =... ] [ Type= "Machine"; Name= "xxy.cs...."; Arch= "iX86"; OpSys= "Solaris"; Mips= 104; Kflops= 21893; State= "Unclaimed"; LoadAvg= ;... Rank=...; Requirements=...; ]
19 Attribute Expressions Constants 104, , "iX86" References attr, self.attr, other.attr, expr.attr Operators+, *, >>, =, &&,... Functions strcat, substr, floor, member,... Lists { expr, expr,... } ClassAds [ name=expr; name=expr;... ]
20 Examples Descriptive attributes Type = "Job"; Owner = "raman"; Arch = "iX86"; OpSys = "Solaris"; Memory = 64;// megabytes Disk = ;// k bytes
21 Examples Current state Daytime = 36017;// secs past midnight KeyboardIdle = 1432;// seconds State = "Unclaimed"; LoadAvg = ;
22 Examples Parameters ResearchGrp = { "raman", "miron", "solomon", "jbasney" }; Friends = { "tannenba", "wright" }; Untrusted = { "rival", "riffraff" }; WantCheckpoint = 1;
23 Examples Derived data Rank =// machine's rank for job 10 * member(other.Owner,ResearchGrp) + member(other.Owner, Friends); Rank =// job's rank for machine Kflops/1E3 + other.Memory/32;
24 Examples Job constraint Requirements = other.Type = "Machine" && Arch = "iX86" && OpsSys = "Solaris" && Disk > && other.Memory >= self.Memory;
25 Examples Machine constraint Requirements = ! member(other.Owner, Untrusted) && Rank >= 10 ? true : Rank > 0 ? (LoadAvg 15*60) : DayTime 18*60*60;
26 Matching Algorithm To match two ads A and B Set up environment such that in A self evaluates to A other evaluates to B other attributes are searched for first in A and then in B and vice versa (with A and B interchanged) Check if A.Requirements and B.Requirements both evaluate to true A.Rank and B.Rank for preferences
27 Three-valued Logic other.Memory > 32all other.Memory == 32UNDEFINED other.Memory != 32 if other has no !(other.Memory == 32)"Memory" attribute other.Mips >= 10 || other.Kflps >= 1000 TRUEif either attribute exists and satisfies the given condition
28 Recipe Tip: Build from Bottom up! Start with a service for a single user, on a single machine. Personal Condor Condor on your own workstation, no local system/root access required, no system administrator intervention needed
29 your workstation personal Condor 600 Condor jobs
30 Personal Condor?! Whats the benefit of a Condor Pool with just one user and one machine?
31 Your Personal Condor will... … keep an eye on your jobs and will keep you posted on their progress … implement your policy on the execution order of the jobs … keep a log of your job activities … add fault tolerance to your jobs … implement your policy on when the jobs can run on your workstation
32 Expand from your desktop… Build a Condor pool inside your organization Install Condor on multiple machines, pointing them to your initial machine as the manager. Utilize Condor resources at remote organizations (build a grid) Takes advantage of your Condor-using friends… Get permission to access their resources flock Then configure your Condor pool to flock to these pools Accounting system is flocking aware
33 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool
34 Condor-G What about resources at remote organizations that are NOT managed via Condor? (perhaps they are managed via PBS, SGE, LSF, …) Condor-G Job task-broker for Grid Middleware. Submit jobs to resources managed via grid middleware such as Globus (GT2 & GT3), Nordugrid, Unicore, or Oracle (or Condor) Oracle: run PL/SQL programs on Oracle just like a normal job, via transactions, put in DAGs, etc.
35 Condor GlideIn Problems What if the grid middleware or remote scheduler doesnt provide services I want? What about end-to-end semantic guarantees? Solution Submit the Condor daemons to remote schedulers instead of the job When the resources run these GlideIn jobs, they will temporarily join her Condor Pool, and run the job as usual.
36 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Condor Pool glide-in jobs
37 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data! Harmonize computation w/ data storage and data movement.
38 Data Movement: Stork Scheduler for wide-area data transfer Condor historically focused on CPU allocation Data movement was implicit side-effect Stork elevates data movement to be a first class citizen Data movement is another type of node within a job dependency graph Data movement is now queued, scheduled, monitored, managed, check-pointed
39 Data Access: Parrot Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!). Works with legacy programs % gv /gsiftp/ % grep Yahoo /http/
40 Data Storage: NeST Storage management software Complementary piece of Condor software; adds storage management to the traditional CPU management Key features User level Guaranteed storage reservations that allow higher-level scheduling and planning (e.g. Stork) Flexible, extendible protocol layer allows easy integration with existing middle-ware and applications Easily deployable via glide-in
41 Practical and easily deployable User-level; requires no privilege Package NeST as standard batch jobs Result: Managed storage General; glide-in works everywhere Gliding-in storage mgmt Internet SGE NeST Home store
42 BirdBath SOAP Interfaces to Condor Services LBNL: Workflow, ZSI (soon ? LIGO, Laser Interferometer Gravitational-Wave Observatory ) IU: Portals UK College of London|Cambridge:.NET
43 The Idea Computing power is everywhere, we try to make it usable by anyone.
44 Thank you! Condor Project on the Web: