CREAM-CE/HTCondor site

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.
HTCondor at the RAL Tier-1
HTCondor within the European Grid & in the Cloud
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,
Two Years of HTCondor at the RAL Tier-1
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
FESR Consorzio COMETA Grid Introduction and gLite Overview Corso di formazione sul Calcolo Parallelo ad Alte Prestazioni (edizione.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Grid Computing I CONDOR.
Grid job submission using HTCondor Andrew Lahiff.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Taming Local Users and Remote Clouds with HTCondor at CERN
CREAM Status and plans Massimo Sgaravatto – INFN Padova
HTCondor Accounting Update
Jean-Philippe Baud, IT-GD, CERN November 2007
Scheduling systems Carsten Preuß
HTCondor and TORQUE/Maui
OpenPBS – Distributed Workload Management System
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
VOs and ARC Florido Paganelli, Lund University
Dynamic Deployment of VO Specific Condor Scheduler using GT4
HTCondor at RAL: An Update
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
Moving from CREAM CE to ARC CE
The CREAM CE: When can the LCG-CE be replaced?
Pierre Girard Réunion CMS
QoS in the Tier1 batch system(LSF)
Monitoring HTCondor with Ganglia
“Better to avoid problems than deal with them when they occur”
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
Pierre Girard ATLAS Visit
ETHZ, Zürich September 1st , 2016
Pete Gronbech, Kashif Mohammad and Vipul Davda
Negotiator Policy and Configuration
PU. Setting up parallel universe in your pool and when (not
The LHCb Computing Data Challenge DC06
Presentation transcript:

CREAM-CE/HTCondor site GRIF: site report CREAM-CE/HTCondor site

What’s GRIF ? A distributed grid site mainly for LHC experiences (started 2005) 6 labs around Paris ~ 7PB ~ 15k logical core Goal Be seen and used as a unique site Create 1 tech. Team distributed on labs

HTCondor @GRIF 3 sub-sites leads the migration to HTCondor Due to scalability issue on some site Better multicore support Leak of support on some historical component (maui) Migration started on July 2014 One ARC-CE/HTCondor for multicore support 2 other sub-site chose to keep CREAM-CE and HTCondor as batch system (January 2015)

Why CREAM-CE ? We support a lot of differents VO Not want to do some validation stuff for VO frontend We know the CREAM-CE so easiest for us to understand where the problem is We were happy with CREAM-CE

CREAM-CE/HTCondor Stuff are implemented on CREAM-CE since a long time 95% of the job was done But since now no sites were use it Some stuff were broken Some stuff missed But a shell script can be put by site between CREAM-CE and HTCondor schedd /usr/libexec/condor_local_submit_attributes.sh

Queue vs ClassAd CREAM-CE is based on Queue But it use queue only as a endpoint No need to linked it with anything behind What we want to do with queue Mainly apply some policy to a queue (multicore / long-time job/ short-time job / …) Put a ClassAd to specify which queue was used

Scheduling policy Main idea of our grid site is You have a priority to access to X% of our cluster But you can have more if available And we ensure your job will run up to 72H This is what you can do with Dynamic quota (based on accounting_group) Accept_surplus & autoregroup Preempting should not be enabled

Hierarchic Fair Share A VO request since a long time Could you have 50% of my fair share dedicated to analysis And 50% to production On CREAM-CE side we create one queue per activities And put the right accounting_group information on submit config

Group assignment Key point of scheduling policy No way provide by CREAM-CE out-of-the-box But shell script have access to user proxy So we can compute vo name and role Associate it to a group and or subgroup Shell script also have access to CREAM queue name So we can use it for subgroup instead of role

Multicore support Some VO want to use multicore job Based on the agreement that 1 multicore job will take 8 cores We just put all our slot as partitionable But hardest to monitor the effective usage of our cluster as HTCondor provide a job based view

Single / Multicore For fair share we create a new subgroup ‘multicore’ With it s own quota To avoid infinite idle job we active DEFRAG Based on the idea that drain stop when 8 core is available (thx Andrew from RAL) But no dynamic configuration modification (DrainBoss)

Partitionable slot and BDII A BDII is a information system database Regroup all information about your site Based on a script that run on your CREAM-CE But as CREAM-CE/HTCondor BDII plugin not supported partitionable slot We re-write it in python with HTCondor python bindings

Accounting in HTCondor Based on work done @RAL We convert HTCondor history to PBS accounting log And push it on national accounting as a classic CREAM-CE / Torque cluster

1 year production No big issue No regrets But…

Job HELD Some job go HELD CREAM-CE submit the job but user proxy is not available on CREAM sandbox It was a pure CREAM-CE issue Was very easy to debug HELD message was clear (file missing)

Accountantnew.log corruption Just trigged last week When CREAM-CE is overloaded, shell script raise « Java out of memory .. » exception Lot of space on error message But not catch by our script that pass the message as a group name On restart Negiciator complains that accountantnew.log is corrupt

Future plans HTCondor CE Create 1 GRIF HTCondor pool CREAM-CE is now our bottleneck 1 middleware Create 1 GRIF HTCondor pool 1 CE (submitter) per site H.A Central Manager Each worker node available for every CE Probably add some ClassAd (WorkerLocation) to allow Ranking