Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21
Outline RACF background RACF condor batch system USATLAS grid job submission using condor-g
RACF Brookhaven (BNL) is multi-disciplinary DOE lab. RHIC and ATLAS Computing Facility (RACF) provides computing support for BNL activities in HEP, NP, Astrophysics, etc. –RHIC Tier0 –USATLAS Tier1 Large installation –7000+ cpus, 5+ PB of storage, 6 robotic silos with capacity of 49,000+ tapes Storage and computing to grow by a factor ~5 by 2012.
New Data Center rising New data center will increase floor space by a factor ~2 in summer of 2009.
BNL Condor Batch System Introduced in 2003 to replace LSF. Steep learning curve – much help from Condor staff. Extremely successful implementation. Complex use of job slots (formerly VM’s) to determine job priority (queues), eviction, suspension and back-filling policies.
Condor Queues Originally designed with vertical scalability –Complex queue priority configuration per core –Maintainable with old less core hardware Changed to horizontal scalability in 2008 –More and more Multi-core hardware now –Simplified queue priority configuration per core –Reduce administrative overhead
Condor Policy for ATLAS (old)
ATLAS Condor configuration (old)
Condor BNL
ATLAS Condor configuration (new)
Condor Queue Usage
Job Slot Occupancy (RACF) Left-hand plot is for 01/2007 to 06/2007. Right-hand plot is for 06/2007 to 05/2008. Occupancy remained at 94% between the two periods.
Job Statistics (2008) Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since PHENIX executed ~50% of its jobs in the general queue. General queue jobs amounted to 37% of all RHIC Condor jobs during this period. General queue efficiency increased from 87% to 94% since 2007.
Near-Term Plans Continue integration of Condor with Xen virtual systems. OS upgrade to 64-bit SL5.x – any issues with Condor? Condor upgrade from to stable series 7.2.x Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.
Condor-G Grid job submission BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system. PanDA Job Flow
One critical service is to maintain PanDA autopilot submission using Condor-G –Very large number (~15000) of current pilot jobs as a single user –Need to maintain very high submission rate Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites –Three Condor-G submit hosts in production Quad-core Intel Xeon 2.66GHz, 16G Memory and two 750GB SATA drives (mirrored disks)
We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team. Weekly OSG Gratia Job Count Report for USATLAS VO
New Features and Tuning of Condor-G submission (not a complete list)
Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources. $> condor_status -grid Name Job Limit Running Submit Limit In Progress gt2 gt2 gt2 heroatlas.fas.ha gt2 osgserv01.slac.s gt2 osgx0.hep.uiuc.e gt2 tier2-01.ochep.o gt2 uct2-grid6.mwt gt2 uct3-edge7.uchic
Nonessential jobs –Condor assumes every job is important, it carefully holds and retries Pile-up of held jobs often clogs condor-g, prevents it from submitting new jobs –A new job attribute, Nonessential, is introduced. Nonessential jobs will be aborted instead of being put on hold. –Suited for “pilot” jobs pilots are job sandbox, not real job payload. Pilots themselves are not as essential as real jobs. Job payload connects to PanDA server through its own channel. PanDA server knows their status and can abort them directly if needed.
GRID_MONITOR_DISABLE_TIME –New configurable condor-g parameter Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job –Old default value of 60 minutes is too long New job submission quiet often pauses during the wait time, job submission can not sustain at high rate level –New value is 5 minutes Much better submission rate seen in production. –Condor-G developers have plan to trace the underneath Grid monitor failures, in Globus context
Separate throttle for limiting jobmanagers based on their role –Job submission won’t compete with job stage_out/removal Globus bug fix –GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates. –We ran cronjob to periodically kill GAHP server to clear up the connections issue. Slower job submission rate. –New condor-g binary compiles against newer Globus libraries, so far so good. Need more time to verify.
Some best practices in Condor-G submission –Reduce frequency of voms-proxy renewal on the submit host Condor-G aggressively pushes out new proxies to all jobs Frequent renewal of voms-proxy on the submit hosts slow down job submission –Avoid hard-kill jobs (-forcex) from client side Reduces job debris on the remote gatekeepers On the other hand, on the remote gatekeepers, we need to more aggressively clean up debris
Near-Term Plans Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.