AliEn central services (structure and operation) Costin.Grigoras@cern.ch
ALICE Offline Week - July 2008 Central machines 5 32bit 15 64bit 6 Macs ----------- 26 machines on 2 * 1Gbit uplinks Different roles, MonALISA services running on them report machine monitoring + each services' status at: http://pcalimonitor.cern.ch/stats?page=machines/machines 19.2 KVA UPSs (15m..50m backup): http://pcalimonitor.cern.ch/stats?page=ups/ups 11.07.2008 ALICE Offline Week - July 2008
AliEn services – User interaction 3 Authen 3 Proxy 2 User API Services 4 Jobs API Services 11.07.2008 ALICE Offline Week - July 2008
AliEn services – Internal services PackMan, IS, Logger, TransferMgr MonALISA repository PackMan, Optimizers (Transfer, Catalogue, Jobs) MySQL – Catalogue, LDAP master MySQL – Task Queue, LDAP slave (currently there are >44M entries in the catalogue, ~100x more than what you have on a PC) Alice::CERN::SE xrootd redirector 11.07.2008 ALICE Offline Week - July 2008
AliEn services – backup pcalienstorage: 9TB raw / 6TB available for backup MySQL slave for both catalogue and task queue DBs (weekly stop / take snapshot / restart) /backup on all machines mounted over NFS from this machine /opt/alien on all central machines is also mounted from this machine over NFS, with different base paths for each architecture 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Build servers 32bit SLC4 64bit SLC4 32bit OSX 10.5 64bit OSX 10.5 (+Itanium build server in CC) 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 DNS load balancing Each machine reports through ML to the central repository the full status of each machine, including: Operational status of each service (tested every 15m) Load on the machine, CPU, memory and swap utilisation No. of connected sockets A weighted score is generated based on the parameters above, updating every minute the CERN DNS aliases with the IP addresses of the machines that are not overloaded. The IP aliases are queried by users or site services when connecting to the central services; by using them we distribute the load evenly between the active machines and limit the damage that can be caused to the central services. TODO: faster reaction times to services not working / overloaded 11.07.2008 ALICE Offline Week - July 2008
DNS load balancing in action Wed Jul 9 07:23:24 CEST 2008 : alice-proxy 137.138.99.136 137.138.99.137 137.138.99.141 Thu Jul 10 13:40:38 CEST 2008 : alice-proxy 137.138.99.137 137.138.99.141 Thu Jul 10 13:44:52 CEST 2008 : alice-proxy 137.138.99.136 137.138.99.137 137.138.99.141 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Making use of the Macs 6 8-core machines with 8GB of RAM...sounds very tempting! Pablo managed to start both Authen and Proxy on alimacx01 in almost no time, BUT... The services kept crashing very fast: Default ulimit -u : 266 Max ulimit -u : 2500 With these constraints, we cannot use the machines for anything spawning many processes (eg. Proxy). Authen runs fine though, as probably would several other central services. 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Running jobs profile 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Load comparison ”The more jobs, the less problems” ? (not quite, the load is higher when many jobs start / finish, or worse when a SE is not available and cause an avalanche of failing jobs) 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Load at >10k jobs 11.07.2008 ALICE Offline Week - July 2008
Running jobs vs. Load (last 6 months, 2hours averages) 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Future plans Upgrade old central machines (2+ years) with more modern hardware (8 cores, 16-32GB RAM, fast SAS drives) Use all available resources (especially the Macs) to be prepared to run at least 2x more jobs Install two additional power lines (16A) to accomodate the greedy hardware Maybe install some additional AC unit... 11.07.2008 ALICE Offline Week - July 2008
ALICE Offline Week - July 2008 Last slide :) 11.07.2008 ALICE Offline Week - July 2008