Operation team at Ccin2p3 Suzanne Poulat –
Overview Operation Team Organisation Operation’s role Services during out of working hours Tools Monitored services Examples Suzanne Poulat -
Operation team Two groups : Support and Operation Support (9 persons) : −general user support, −dedicated persons for LHC experiments, −help-desk(Xhelp), −opening CC to collaborations and other sciences Operation : details follow 3Suzanne Poulat -
Organisation Ten persons in the group −two for Grid coordination −Four for Operation −Four operators in shift to cover 08:00AM to 09:PM 7/7 on a weekly basis : −one person for operation (often 1.5) −The others have tasks as developments, monitoring or administrative tasks 4Suzanne Poulat -
Operation’s role Check the avalaibility of all services (storage, cpu,…) Optimize service usage Insure that commitments of CCIN2P3 for the experiments and Grid VOs are respected Organize the scheduled shutdowns Coordinate actions during unscheduled downtimes Monitoring and management of tape libraries Create and manage accounts and AFS space Organize the « on duty » service 5Suzanne Poulat -
Services - Out of working hours On site night security guard from 6PM to 8AM and weekends –no computing actions : Alerting and Messaging 1 on-duty engineer (evenings, weekends) –Corrective actions if possible (documentations, Training) –else call an expert … if available Weekend : 1 operator on site (10AM – 5PM) –first low level action –else call on-duty engineer Result is a « Best effort » coverage 6Suzanne Poulat -
tools Monitoring tool : NGOP -> Nagios Remote Logging Service : RLS Mails Tickets from local and grid users : Xhelp interfaced with GGUS at CC Web pages on the current state of services Wiki for documentation, recipes, shutdowns, postmortem analysis log of the daily production : ELog Tickets web page for tapes and drives incidents (~50 incidents per month : 10 drives, 40 tapes with 2 lost of data) Scripts to analyse faulty tapes 7Suzanne Poulat -
Monitored services BQS Storage : HPSS, dCache, AFS Grid : CE, SRM, TOP BDII Databases Others : Tape libraries, Saphir (privileges and location of services) Workers and all servers Suzanne Poulat -
Nagios 9
SMURF 10
Anastasie – Running jobs Suzanne Poulat -
Xhelp Suzanne Poulat -
Xhelp (2) Suzanne Poulat - ~320 tickets by month = 10 to 20 tickets by days
Xhelp (3) Suzanne Poulat -
implementations Wiki Operation Wiki Operation Nagios monitoring Nagios monitoring Ovax Ovax Users database Interface Users database Interface Incidents robotique Incidents robotique On duty tools On duty tools 15Suzanne Poulat -
QUESTIONS ? 16Suzanne Poulat -