JRA 1 Progress Report ETICS 2 All-Hands Meeting Alain Roy and Becky Gietzel University of Wisconsin-Madison Palermo, October 2008
Personnel Change Peter Couvares has left the Condor Project & ETICS Becky Gietzel now manages the UW build and test facility Todd Miller now manages the Metronome software Alain Roy is the ETICS JRA 1 Work Package Manager Nate Griswold is system administrator Peter is now at: visiblecertainty.com JRA 1 Progress Report Palermo, October2008
Major focuses of activity right now Focus 1: Remote job submission Focus 2: Submission to other batch systems JRA 1 Progress Report Palermo, October 2008
Focus 1: Remote Job Submission Goal: Ability to submit from one build and test facility to another. Approach: When a job cannot run be run locally, run job with Condor-C on remote pool. Questions you might ask: Why can’t a job run locally? What is this Condor-C stuff? JRA 1 Progress Report Palermo, October 2008
Question: Why couldn’t a job run locally? When you submit the job, even if you allow job migration: Condor will run the job locally, if a computer is available. You might have computers available locally, but they’re busy. You might not have computers available locally: perhaps you are request a platform that only exists at a remote site. Metronome will try to run the job remotely when: 5 minutes have passed without match (configurable). … and the Metronome administrator allows remote job submission. … and the job owner allows remote job submission. JRA 1 Progress Report Palermo, October 2008
Question: How do you run the job remotely? What is this Condor-C stuff? There are two components: Job Router: Watches for a job that can migrate Rewrites job very slightly. No longer a “vanilla” Condor job A Condor-C job Condor-C: Instead of matching a job to a computer, runs a job at a remote Condor site Instead of submitting a job to a Condor startd (execution computer), submits to a Condor schedd (submit computer) Implication: matching will happen again at remote site JRA 1 Progress Report Palermo, October 2008
Diagram of Remote Job Submission Local Site Condor Matchmaker (for computers) Condor Submitter (Schedd) 1 Condor Worker Nodes (startd) 2 1 Condor Worker Nodes (startd) Remote Site Condor Submitter (Schedd) Condor Matchmaker (for computers) 2 2 JRA 1 Progress Report Palermo, October 2008
State of Remote Job Submission Tested in testbed: it works well! Running 24 jobs per day (1 per hour) Working 100% Currently moving to pre-production We hope to demonstrate in pre-production very soon Requires software upgrades: Metronome upgrade to 2.5.x Condor upgrade to 7.1.x JRA 1 Progress Report Palermo, October 2008
Focus 2: Submission to Other Batch Systems We are currently prototyping submission to other batch systems. Approach: Use Condor-G Conceptually similar to Condor-C, but instead of submitting to Condor, we can submit to: Unicore CREAM NorduGrid GRAM 2 (pre-web services GRAM) GRAM 4 (web-services GRAM) PBS LSF JRA 1 Progress Report Palermo, October 2008
Tradeoffs When we don’t use plain old Condor or Condor-C, there are tradeoffs. Some apply to using Condor-G, some when you use other, non-Condor solution. Metronome uses Condor streaming I/O for real-time updates. Metronome uses Condor DAGMan to control set of jobs which makes up a build/test Works great with Condor-G and Condor-C Condor has mechanisms to recover and/or restart failed jobs Some work with Condor-G Hawkeye for computer information (used for matching) Co-scheduling (parallel jobs) JRA 1 Progress Report Palermo, October 2008
Questions? JRA 1 Progress Report Palermo, October 2008