Lessons learned administering a larger setup for LHCb Dirac User Workshop Joel Closier 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 LHCbDirac in numbers Sites 1 T0 8 T1 (IN2P3, RAL, RRCKI, CNAF, GRDIKA,PIC, SARA, CERN) 66 LCG, 7 DIRAC, 6 VAC, 14 CLOUD, 3 BOINC 620 Users registered in VOMS 100 M Pilots run until 1st January 2010 120 M jobs run until 1st January 2010 71% Simulation 35% User 7% Stripping 3% Merge 3% Swimming 2% Reconstruction 1.5% Reprocessing Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 Storage in LHCb More than 130 SE (25 PB) Most common operations 48.5 PB Replicate and register 44 PB putAndRegister 32 PB stage 28 PB RemovePhysicalReplica Dirac User Workshop – Joel Closier – 23 may 2016
LHCbDirac services : Evolution 2010 – 2014 : Use of physical machines (up to 27) Between 16 and 32 CPUs each Managed by Quattor 2016 – today Use of virtual machines 51 instances (3 for tests/certifications, 3 for jenkins tests) 150 VCPUs 305 GB RAM 10 CEPH volumes Managed by Puppet 4 templates (standard, webportal, rabbitmq and lhcbci) Same software installed Same rules for firewall and iptables Same local user Dirac User Workshop – Joel Closier – 23 may 2016
First configuration of VM for Dirac services 39 Virtual Machines (puppet managed) 2 (8 CPUs 16 GB memory) 11 (4 CPUs 8 GB memory) 26 (2 CPUs 4Gb memory) 7 CEPH volumes (2.9 TB) BOINC Sandboxes Monitoring Transformation Log Swap Failover /opt/dirac on the ephemeral disk of each VM Dirac User Workshop – Joel Closier – 23 may 2016
Evaluation of the first configuration Many machines to manage Small VM With high load => processes killed Small swap I/O not efficient with ephemeral disk Update of DIRAC software painfull No way to do it locally Through Web Portal inefficient Through CLI too long to do it in one single thread Dirac User Workshop – Joel Closier – 23 may 2016
Second iteration for the configuration Bigger VM 16 CPUs 32 GB RAM /opt/dirac on CEPH volume Management much easier Installation of DIRAC faster Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 Why this evolution ? Some services needs to have several instances BookkeepingManager JobStateUpdate ResourceStatus Optimizers TransformationManager Some services needs load balancing Configuration server (hammer by all the pilots..) (to be tested) Some services are busy for a given period Better usage of the VM with big machine Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 VOBOXes - Monitoring The main entry point is : https://lhcb-portal-dirac.cern.ch/DIRAC/ Activity Monitor Dashboard System Administration Dirac User Workshop – Joel Closier – 23 may 2016 9
Monitoring of the machine Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 VOBOXes - Alarms Alarm so far defined Filesystem full /opt full Swap space High load Each alarm open a ticket Dirac User Workshop – Joel Closier – 23 may 2016 11
Dirac User Workshop – Joel Closier – 23 may 2016 LHCbDirac Setups 2 Setups (previously 3) Production Certification All new version of dirac can be tested with the Certification setup except few of them because of Configuration Server Testing of this setup is associated with Jenkins to automatize most of the steps : Consistency of code Installation Production jobs Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 Databases 2 types of databases used in Production ORACLE (Bookkeeping) DBOD (DataBase On Demand) Lbacc Lbprod Lbwms Lbwmsacc dfc 2 types of databases used in Certification DBOD (DataBase On Demand) : lbcertif, lbprdev Dirac User Workshop – Joel Closier – 23 may 2016 13
Dirac User Workshop – Joel Closier – 23 may 2016 Services outside CERN Most of the services are located at CERN and are duplicated on several instances at CERN 6 machines outside CERN, located in the T1 sites used by LHCb RAL GRIDKA IN2P3 CNAF PIC SARA Machine used for duplication of services Configuration Server (slave instance) ReqProxy Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 Main issues With such a configuration and with the tools that we have in place Difficulties to spot services/agents/optimizers which are stuck Installation of new version of Dirac delicate Recover VM down not trivial (no live migration) Web interface for Sysadministration needs improvement Dirac User Workshop – Joel Closier – 23 may 2016
Dirac User Workshop – Joel Closier – 23 may 2016 Conclusions Web portal usefull with its system administration console to manage large set of machine but missing functionnalities Dirac update not very friendly Extension version number for the VO not displayed Lot of clicks to get meaningfull error Duplication of service help a lot the load of the machine Single point of failure ?? Dirac User Workshop – Joel Closier – 23 may 2016