Condor and the Grid D. Thain, T. Tannenbaum, M. Livny Christopher M. Moretti 23 February 2007
2 Problem & Opportunity Users need CPUs Scientific computing Mathematical modeling Data mining Many CPU cycles are unused Personal workstations General use laboratories Research machines
3 Solution: Condor “A hunter of idle workstations” Keeps track of resources needed and available Determines and assigns matches Monitors progress Cleans up and reports results
4 Architecture Three principals: Agent: machine needing resources Matchmaker Resource: machine lending resources Three phases: Advertising Matching/Claiming Deploying/Executing
5 Advertising MatchMaker AgentLender I need X I have Y idle.cse.nd.eduneedy.cse.nd.edu Does Y satisfy X?
6 Matching & Claiming MatchMaker AgentLender Use idle.cse.nd.edu Listen for needy.cse.nd.edu idle.cse.nd.eduneedy.cse.nd.edu Are you still available? Yes.
7 Deploying / Executing AgentLender idle.cse.nd.eduneedy.cse.nd.edu Shadow Fork! Run job J. J I need file /tmp/foo. Sandbox Split Execution
8 Matching How are matches determined? Policy ClassAds Why independently claim a match? What if the Matchmaker dies?
9 ClassAds MyType=“Job” TargetType=“Machine” Requirements= (( other.Arch==“INTEL” && other.OpSys==“LINUX” && KeyboardIdle>600 )) Cmd=“/tmp/a.out” Owner=“cmoretti” MyType=“Machine” TargetType=“Job” Machine= “dustpuppy.cse.nd.edu” Requirements= (( KeyboardIdle>600 )) Arch=“INTEL” OpSys=“LINUX”
10 Flocking Using another pool’s resources Utilize more total resources Find resources that match needs Two methods Gateway flocking Direct flocking
11 Gateway Flocking Each pool has a known “gateway” Gateways negotiate sharing Advertise resources and needs Transmit requests to local matchmaker Pool-level granularity Accounting Policy Now obsolete
12 Gateway Flocking Gateway MM R R R R R R R A
13 Direct Flocking Agents report to other matchmakers No gateways Equivalent to being in multiple pools? Now the preferred (only) method
14 Gateway Flocking MM R R R R R R R A 2 3 1
15 Flocking Comparison + Transparency + Fosters organization-level sharing - Poor accounting - Complicated + No gateways + Individual relationships supported - Non-transparent - Fewer organization-level agreements Gateway FlockingDirect Flocking
16 Things Aren’t Perfect What happens if (when) … Matchmaker goes down Network or Agent fails during deploy Resource or App fails during compute Non-dedicated machines. How do we keep owners happy? What happens when an owner reclaims a resource?
17 Total Consumption in 2006 CPU-Hours Harnessed by Condor(48%) CPU-Hours Totally Unused(39%) CPU-Hours Consumed by Owner at Keyboard(11%) CPU-Hours Total(100%) Condor at Notre Dame Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006 “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
18 Current Donors Feb 2007 OwnerNodesCPUsStorage (TB) CRC/OIT CSE Prof. Thain Prof. Flynn Prof. Striegel Misc717 Total TB Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006 “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
19 CPU History Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006 “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
20 Recap Condor facilitates distributed computation on dedicated or scavenged CPUs arranged by a matchmaker using ClassAds. Split Execution is necessary to fit the job’s needs to the environment. An agent can advertise to multiple matchmakers to examine more potential matches.