Download presentation
Presentation is loading. Please wait.
1
Moving from CREAM CE to ARC CE
Andrew Lahiff
2
The short version Install ARC CE Test ARC CE
Move ARC CE(s) into production Drain CREAM CE(s) Switch off CREAM CE(s)
3
Migration at RAL In 2013 we combined
migration from Torque to HTCondor migration from CREAM CE to ARC CE Initial reasons for choice of ARC CE we didn’t like CREAM HTCondor-CE was still very new, even in OSG had heard good things about ARC Glasgow & Imperial College in the UK had already tried it looked much simpler than CREAM YAIM not required ATLAS use it a lot
4
Migration at RAL Initially had CREAM CEs + Torque worker nodes
Torque server / Maui worker nodes (Torque) APEL glite-CLUSTER
5
HA HTCondor central managers
Migration at RAL Added HTCondor pool + ARC & CREAM CEs CREAM CEs (Torque) Torque server worker nodes (Torque) APEL glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd CREAM CEs condor_schedd
6
HA HTCondor central managers
Migration at RAL Torque batch system decommissioned APEL glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd CREAM CEs condor_schedd
7
HA HTCondor central managers
Migration at RAL CREAM CEs & APEL publisher decommissioned - once all LHC VOs & non-LHC VOs could submit to ARC glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd
8
HA HTCondor central managers
Migration at RAL glite-CLUSTER decommissioned ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd
9
ARC CEs at RAL 4 ARC CEs – each is a VM with
4 CPUs 32 GB memory most memory usage comes from the condor shadows we use 32-bit HTCondor rpms will move to static shadows soon see slapd using up to ~1 GB we wanted to have lots of headroom! (were new to both ARC and HTCondor) Using multiple ARC CEs for redundancy & scalabilty
10
ARC CEs at RAL Example from today – 5.5K running jobs on a single CE
11
Usage since Oct 2013 Generally have 2-3K running jobs per CE
Running jobs per ARC CE monitoring glitch
12
Things you need to know
13
glite-WMS support Some non-LHC VOs still use glite-WMS
getting less & less important In order for the WMS job wrapper to work with ARC CEs, need an empty file /usr/etc/globus-user-env.sh on all worker nodes
14
Software tags Software tags (almost) no longer needed due to CVMFS ARC
some non-LHC VOs may need them however again, probably getting less & less important ARC runtime environments appear in the BDII in the same way as software tags unless you have a shared filesystem (worker nodes, CEs), no way for VOs to update tags themselves our configuration management system manages the runtime environments mostly just empty files
15
Information system Max CPU & wall time not published correctly
only a problem for the HTCondor backend no way for ARC to determine this from HTCondor could try to extract from SYSTEM_PERIODIC_REMOVE? what if someone does this on the worker nodes, e.g. WANT_HOLD? We modified /usr/share/arc/glue-generator.pl
16
Information system - VO views
ARC reports the same number of running & idle jobs for all VOs We modified /usr/share/arc/glue-generator.pl cron running every 10 mins queries HTCondor & creates files listing numbers of jobs by VO glue-generator.pl modified to read these files Some VOs still need this information (incl LHC VOs) hopefully the need for this will slowly go away
17
Information system – VO shares
VO shares not published Added some lines into /usr/share/arc/glue-generator.pl GlueCECapability: Share=cms:20 GlueCECapability: Share=lhcb:27 GlueCECapability: Share=atlas:49 GlueCECapability: Share=alice:2 GlueCECapability: Share=other:2 Not sure why this information is needed anyway
18
LHCb DIRAC can’t specify runtime environments
we use an auth plugin to specify a default runtume environment we put all essential things in here (grid-related env variables etc) Default runtime environment needs to set NORDUGRID_ARC_QUEUE=<queue name>
19
Multi-core jobs In order for stdout/err to be available to VO, need to set RUNTIME_ENABLE_MULTICORE_SCRATCH=1 in a runtime environment In ours we have: if [ "x$1" = "x0" ]; then export RUNTIME_ENABLE_MULTICORE_SCRATCH=1 fi (amongst other things)
20
Auth plugins Can configure an external executable to run every time a job is about to switch to a different state ACCEPTED, PREPARING, SUBMIT, FINISHING, FINISHED, DELETED Very useful! Our standard uses Setting default runtime environment for all jobs Scaling CPU & wall time for completed jobs Occasionally for debugging keep all stdout/err files for completed jobs for a particular VO
21
User mapping Argus for mapping to local pool accounts (via lcmaps)
In /etc/arc.conf [gridftpd] ... unixmap="* lcmaps liblcmaps.so /usr/lib64 /usr/etc/lcmaps/lcmaps.db arc" unixmap="banned:banned all” Setup Argus policies to allow all supported VOs to submit jobs
22
Monitoring
23
Monitoring - alerts ARC Nagios tests Check proc a-rex
Check proc gridftp Check proc nordugrid-arc-bdii Check proc nordugrid-arc-slapd Check ARC APEL consistency check that SSM message sent successfully to APEL < 24 hours ago Check HTCondor-ARC consistency check that HTCondor & ARC agree on number of running + idle jobs
24
Monitoring - alerts HTCondor Nagios tests Check HTCondor CE Schedd
check that the schedd ClassAd is available we found that a check for condor_master is not enough, e.g. if you have a corrupt HTCondor config file Check job submission HTCondor check that Nagios can successfully submit job to HTCondor
25
Monitoring - Ganglia Ganglia metrics standard host metrics
Gangliarc: ARC specific metrics condor_gangliad HTCondor specific metrics
26
Monitoring - Ganglia +more...
27
Monitoring - InfluxDB 1-min time resolution ARC CE metrics
job states, time since last arex heartbeat HTCondor metrics include: shadow exit codes numbers of jobs run more than once
28
Monitoring - InfluxDB
29
Problems we’ve had APEL central message broker hardwired in config
when hostname of the message broker changed once, APEL publishing stopped now have Nagios check for APEL publishing ARC-HTCondor running+idle jobs consistency before scan-condor-job was optimized, had ~2 incidents in the past couple of years where ARC lost track of jobs best to use ARC version > 5.0.0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.