5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague
Overview Introduction Service planning and current status –Capacities –Networking –Personnel Monitoring –HW and SW –Middleware –Service Remarks
Introduction Czech Republic’s LHC activities –ATLAS, target 3% of authors -> activities –ALICE, target 1 % –TOTEM, much smaller experiments, relative target higher. – (non LHC – HERA/H1, TEVATRON/D0, AUGER) Institutions (mention just big groups) –Academy of Sciences of the Czech Republic Institute of Physics Nuclear Physics Institute –Charles University in Prague Faculty of Mathematics and Physics –Czech Technical University in Prague Faculty of Nuclear Sciences and Physical Engineering HEP manpower (2005) –145 people 59 physicists 22 engineers 21 technicians 43 undergraduate students a PHD students
Service planning ATLAS + ALICE CPU (MSI2000) Disk (TB) MSS (TB) Table based on LCG MoU for ATLAS and Alice and our anticipated share Project proposals to various grant systems in the Czech Republic Prepare bigger project proposal for CZ GRID together with CESNET –For the LHC needs –In 2010 add 3x more capacity for Czech non-HEP scientists, financed fro state resources and structural funds of EU All proposals include new personnel (up to 10 new persons) Today, regular financing, sufficient for D0 –today 250 cores, 150 kSI2k, 40 TB disk space, no tapes
Networking Local connection of institutes in Prague –Optical 1 Gbps E2E lines WAN –Opticla E2E lines to Fermilab, Taipei new FZK (from 1 Sept 06) –Connection Prague – Amsterodam now through GN2 –Planning further lines to other T1s CEF Networks workshop Prauge, May 30 th, 2006
Personnel Now 4 persons to run T2 –Jiri Kosina – middleware (leaving, looking for replacement), Storage (FTS), monitoring –Tomas Kouba – middleware, monitoring –Jan Svec – basic HW, OS, storage, networking, monitoring –Lukas Fiala - Basic HW, networking, web services –Jiri Chudoba – liason to ATLAS and ALICE, running the jobs and reporting errors, service monitoring Further information is based on their experience
Monitoring HW and basic SW –installation and test of new hardware normally choose proven HW HW - installation by delivery firm install operating system and solve problems with delivery firms install middleware test it for some time outside the production service –Nagios working nodes access via ping disks – how the partitions are full load average if pbs_mom process is running number of running processes if ssh demon is running how full is the swap …. Limits for warning and error Distribution of mails or SMS to admins – fixing problems remotely Regular check of nagios web page for red dots –Regular automatic (cron) checks and restarts for some daemons
Monitoring PBS – job count (via RRD and mrtg) –Local tools for monitoring of number of jobs/machine/per chosen period Apel – not much useful, might be setup for more useful info Gridice ATLAS –Checks and statistics from ATLAS database ALICE - Mona Lisa - very useful Monitor pool accounts and actual user certificates Networking –Network traffic to FZK, SARA, CERN in certain ip range –With the help of ipaccounting (utility ipac-ng) SFT – site functional tests – very useful
outgoing to fzk1 Max: 37M Average: 6M Total: 129G outgoing to internet Max: 61M Average: 8M Total: 164G
Updates and patches YAIM + automated updates on all farm nodes using simple BEX script toolkit (takes care of upgrading the node which was switched off at the deployment/upgrade phase... keeps all nodes in sync automatically) ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz, info in README file ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz
SAM GRAPHS SAM at glance – regular check of our SAM station accessibility SAM At A Glance: d0_fzu_prague D0 Production Environment This page generated on 05 Sep :31:28 Server Host:Port Version Up Since Master:Station sam.farm.particle.cz:45274 v4_2_1_77 31 Jul :08:18Master:Station FSS:Sewer sam.farm.particle.cz:45278 v4_2_1_77 31 Jul :08:20FSS:Sewer sam.farm.particle.cz:45281 v4_2_1_77 31 Jul :08:22 NGFSS:Sewer sam.farm.particle.cz:45278 v4_2_1_77 31 Jul :08:20
Service monitoring Using higher described checks and their combinations Rely on centrally/by experiments supported useful monitors We would appreciate to receive early warning if jobs on some site/working_nodes start quickly fail after submission Service requirements for T2s in “extended” working hours –No special plan today –Try to provide architecture that responsible people can even travel and do as much as possible remotely (e.g. network console access) –Future computing capacities will probably require new arrangements
Remarks Sufficient set of monitors for HW, basic SW, middleware indispensable Especially for service monitoring we rely on centrally distributed tools –Big space for additions and improvements –Or just more useful setups –E.g. SFT - Site Functional Tests very useful Service level – no special arrangement, rather rely on remote repairs outside working hours