Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Slides:



Advertisements
Similar presentations
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Advertisements

15 April 2010John Hover Condor Week Enhancements to Condor-G for the ATLAS Tier 1 at BNL John Hover Group Leader Experiment Services (Grid Group)
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Berliner Elektronenspeicherringgesellschaft für Synchrotronstrahlung mbH (BESSY) CA Proxy Gateway Status and Plans Ralph Lange, BESSY.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
New Data Center at BNL– Status Update HEPIX – CERN May 6, 2008 Tony Chan - BNL.
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Grid job submission using HTCondor Andrew Lahiff.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Ashok Agarwal University of Victoria 1 GridX1 : A Canadian Particle Physics Grid A. Agarwal, M. Ahmed, B.L. Caron, A. Dimopoulos, L.S. Groer, R. Haria,
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Proposal for a IS schema Massimo Sgaravatto INFN Padova.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Luca dell’Agnello INFN-CNAF
CREAM-CE/HTCondor site
PES Lessons learned from large scale LSF scalability tests
Condor-G: An Update.
Presentation transcript:

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21

Outline RACF background RACF condor batch system USATLAS grid job submission using condor-g

RACF Brookhaven (BNL) is multi-disciplinary DOE lab. RHIC and ATLAS Computing Facility (RACF) provides computing support for BNL activities in HEP, NP, Astrophysics, etc. –RHIC Tier0 –USATLAS Tier1 Large installation –7000+ cpus, 5+ PB of storage, 6 robotic silos with capacity of 49,000+ tapes Storage and computing to grow by a factor ~5 by 2012.

New Data Center rising New data center will increase floor space by a factor ~2 in summer of 2009.

BNL Condor Batch System Introduced in 2003 to replace LSF. Steep learning curve – much help from Condor staff. Extremely successful implementation. Complex use of job slots (formerly VM’s) to determine job priority (queues), eviction, suspension and back-filling policies.

Condor Queues Originally designed with vertical scalability –Complex queue priority configuration per core –Maintainable with old less core hardware Changed to horizontal scalability in 2008 –More and more Multi-core hardware now –Simplified queue priority configuration per core –Reduce administrative overhead

Condor Policy for ATLAS (old)

ATLAS Condor configuration (old)

Condor BNL

ATLAS Condor configuration (new)

Condor Queue Usage

Job Slot Occupancy (RACF) Left-hand plot is for 01/2007 to 06/2007. Right-hand plot is for 06/2007 to 05/2008. Occupancy remained at 94% between the two periods.

Job Statistics (2008) Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since PHENIX executed ~50% of its jobs in the general queue. General queue jobs amounted to 37% of all RHIC Condor jobs during this period. General queue efficiency increased from 87% to 94% since 2007.

Near-Term Plans Continue integration of Condor with Xen virtual systems. OS upgrade to 64-bit SL5.x – any issues with Condor? Condor upgrade from to stable series 7.2.x Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.

Condor-G Grid job submission BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system. PanDA Job Flow

One critical service is to maintain PanDA autopilot submission using Condor-G –Very large number (~15000) of current pilot jobs as a single user –Need to maintain very high submission rate Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites –Three Condor-G submit hosts in production Quad-core Intel Xeon 2.66GHz, 16G Memory and two 750GB SATA drives (mirrored disks)

We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team. Weekly OSG Gratia Job Count Report for USATLAS VO

New Features and Tuning of Condor-G submission (not a complete list)

Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources. $> condor_status -grid Name Job Limit Running Submit Limit In Progress gt2 atlas.bu.edu: gt2 gridgk04.racf.bn gt2 heroatlas.fas.ha gt2 osgserv01.slac.s gt2 osgx0.hep.uiuc.e gt2 tier2-01.ochep.o gt2 uct2-grid6.mwt gt2 uct3-edge7.uchic

Nonessential jobs –Condor assumes every job is important, it carefully holds and retries Pile-up of held jobs often clogs condor-g, prevents it from submitting new jobs –A new job attribute, Nonessential, is introduced. Nonessential jobs will be aborted instead of being put on hold. –Suited for “pilot” jobs pilots are job sandbox, not real job payload. Pilots themselves are not as essential as real jobs. Job payload connects to PanDA server through its own channel. PanDA server knows their status and can abort them directly if needed.

GRID_MONITOR_DISABLE_TIME –New configurable condor-g parameter Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job –Old default value of 60 minutes is too long New job submission quiet often pauses during the wait time, job submission can not sustain at high rate level –New value is 5 minutes Much better submission rate seen in production. –Condor-G developers have plan to trace the underneath Grid monitor failures, in Globus context

Separate throttle for limiting jobmanagers based on their role –Job submission won’t compete with job stage_out/removal Globus bug fix –GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates. –We ran cronjob to periodically kill GAHP server to clear up the connections issue. Slower job submission rate. –New condor-g binary compiles against newer Globus libraries, so far so good. Need more time to verify.

Some best practices in Condor-G submission –Reduce frequency of voms-proxy renewal on the submit host Condor-G aggressively pushes out new proxies to all jobs Frequent renewal of voms-proxy on the submit hosts slow down job submission –Avoid hard-kill jobs (-forcex) from client side Reduces job debris on the remote gatekeepers On the other hand, on the remote gatekeepers, we need to more aggressively clean up debris

Near-Term Plans Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.