Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Brookhaven Analysis Facility Michael Ernst Brookhaven National Laboratory U.S. ATLAS Facility Meeting University of Chicago, Chicago 19 – 20 August, 2009.
The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,
Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Extending Auto-Tiering to the Cloud For additional, on-demand, offsite storage resources 1.
Building Grids with Condor
US CMS Testbed.
Basic Grid Projects – Condor (Part I)
DØ MC and Data Processing on the Grid
Condor: Firewall Mirroring
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
Frieda meets Pegasus-WMS
Presentation transcript:

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005

About Brookhaven National Lab ● One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE. ● Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department. ● Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects. ● RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

Computing Facility Resources ● Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc. ● 6+ PB permanent tape storage capacity. ● 500+ TB central/distributed disk storage capacity. ● 1.4 million SpecInt2000 aggregrate computing power in Linux Farm.

History of Condor at Brookhaven ● First looked at Condor in 2003 as a replacement for LSF and in-house batch software. ● Installed in August ● Upgraded to in February ● Upgraded to (with startd binary) in August ● User base grew from 12 (April 2004) to 50+ (March 2005).

The Rise in Condor Usage

Condor Cluster Usage

BNL’s modified Condorview

Overview of Computing Resources ● Total of 2750 CPUs (growing to in 2005). ● Two central managers with one acting as a backup. ● Three specialized submit machines which handle ~600 simultaneous jobs each on average. ● 131 of the execute nodes can also act as submission nodes. ● One monitoring/Condorview server.

Overview of Computing Resources, cont. ● Six GLOBUS gateway machines for remote job submission. ● Most machines run SL on the x86 platform, some still using RH 7.3. ● Running with startd binary to take advantage of multiple VM feature.

Overview of Configuration ● Computing resources divided into 6 pools. ● Two configuration models: – Split pool resources into two parts and restrict which jobs can run in each part. – More complex version of the Bologna Batch System. – A pool uses one or both of these models. ● Some pools employ user priority preemption. ● Use “drop queue” method to fill fast machines first. ● Have tools to easily reconfigure nodes. ● All jobs use vanilla universe (no checkpointing).

Two Part Model ● Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction. ● Within Condor, a node advertises itself as either an analysis node or a reconstruction node. ● A job must advertise itself in the same manner to match with an appropriate node. ● Only certain users may run reconstruction jobs but anyone can run an analysis job.

Analysis/Reconstruction Group 3 Group 2 Group 1 Fast Slow vm1 vm2 ● No suspension ● No preemption ● Will start a job if CPU is free Group 1 Group 2 Group 3 Group 4 Group 5 Reconstruction Job: wants group <= 2

A More Complex Version of the Bologna Model ● Two CPU nodes each with 8 VMs. ● 2 VMs per CPU. ● Only two jobs running at a time. ● Four job categories, each with its own priority. ● A high priority VM will suspend a random VM of lower priority. ● The random aspect is to prevent the same VM from always getting suspended.

Analysis/Reconstruction Group 3 Group 2 Group 1 Fast Slow ● Low priority VMs suspended ● No preemption ● Will start a job if CPU is free or is of higher priority Group 1 Group 2 Group 3 Group 4 Group 5 Reconstruction Job: wants group == 3 Med. Priority (vm5/vm6) MC (vm1/vm2) Low (vm3/vm4) Med (vm5/vm6) High (vm7/vm8) High Prio Low Prio

Issues We've Had to Deal With ● Tune parameters to alleviate scalability problems. – MATCH_TIMEOUT – MAX_CLAIM_ALIVES_MISSED ● Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C  Panasas fixed bug. ● High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

Issues We’ve Had to Deal With, cont. ● Dagman problems (latency, termination)  changed from dagman for plain Condor. ● Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off). ● Modified Condorview to meet our accounting & monitoring requirements.

Issues Not Yet Resolved ● Need job ClassAd which gives user's primary group --> better control over cluster usage. ● Transfer output files for debugging when job is evicted. ● Need option to force the schedd to release its claim after each job. ● Allow schedd to set mandatory periodic_remove policy  avoid manual cleanup.

Issues Not Yet Resolved, cont. ● Shadow seems to make a large number of NIS calls. Possible problem with caching  address shadows in vanilla universe? ● Need Kerberos support to comply with security mandates. ● Interested in Condor on Demand (COD), but lack of functionality prevents more usage. ● Need more (and effective) cluster management tools  condor_off works?

Near-Term Plans & Summary ● Waiting for 6.8.x series (late 2005?) to upgrade. ● Scalability concerns as usage rises. ● High availability more critical as usage rises. ● Integration of BNL Condor pools with external pools, but concerned about security. ● Need some functionalities listed above for a meaningful upgrade and to improve cluster management capability.