4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
CCRC08-1 report WLCG Workshop, April KorsBos, ATLAS/NIKHEF/CERN.
John Gordon STFC-RAL Tier1 Status 9 th July, 2008 Grid Deployment Board.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
CCRC – Conclusions from February and update on planning for May Jamie Shiers ~~~ WLCG Overview Board, 31 March 2008.
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
SRM 2.2: experiment requirements, status and deployment plans 6 th March 2007 Flavia Donno, INFN and IT/GD, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
Jamie Shiers ~~~ WLCG MB, 19th February 2008
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Take the summary from the table on
Presentation transcript:

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report

4 March 2008CCRC'08 Feb run - preliminary WLCG report 2 Average daily transfer rates

4 March 2008CCRC'08 Feb run - preliminary WLCG report 3 Best short term period- all 11 sites participating at close to or exceeding current nominal rates

4 March 2008CCRC'08 Feb run - preliminary WLCG report 4 Electronic log analysis – per VO Elog entries were made for service degradations or failures and usually had a GGUS or site ticket associated to them. Over the 26 days there were 171 entries in 81 threads. Breakdown per VO and urgency: Less Urgent Urgent Very Urgent Top Priority 4 Common (CERN WMS) 10 ALICE 1 9 (late start) 17 ATLAS CMS LHCb (disk space)

4 March 2008CCRC'08 Feb run - preliminary WLCG report 5 Elog analysis – per site Site Number of threads CERN 34 IN2P3 10 NL-T1 9 CNAF 8 RAL 6 PIC 3 Tier 2 3 FZK 2 BNL 2 FNAL 2 ASGC 0 TRIUMF 0 Did the Tier 1 and 2 look at the elog – only 3 entries from GRIF and 2 from FNAL/CMS ? How did the sites discover/follow reported problems ?

4 March 2008CCRC'08 Feb run - preliminary WLCG report 6 Elog analysis – per subject (fuzzy) Subject Number of threads Site config 13 most during first 2 weeks SRM 12 FTS 11 Gridftp 7 Dcache 7 Hardware 5 disks, tape robot, power Proxy (corruption) 5 workaround propagation took several days Certificate map 4 wrong user cert used Phedex (CMS) 4 all single bug CASTOR 2 Lcg-utils 2 Other (1 each) 5 vobox, afs, ftd, wms, lfc See Storage-ware Review: Problems Encountered & Roadmap by F.Donno this afternoon.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 7 One detailed post-mortem needed during run Elog 115 Monday Feb 18 at 15:00 ATLAS (S.Campana) –I would like to have this tracked as "Notification problem in CCRC08". I submitted an ELOG, a GGUS and a Ticket to (all of them with maximum priority) on friday and I got no notification from any of the three before monday. Referred to many failures of ATLAS data export during 14 to 18 Feb. Above tickets submitted at on Friday but there is no weekend ticket cover. Detailed timeline analysis showed 4 separate problems, illustrating the type of complex operational problems seen, prepared by CASTOR team at: – –Malformed stager query 14 Feb built up excessive processes on the CERN srm v1 server. Operator alarm triggered 12 hours later only partially understood then machine was seen to be recovering by service manager. –Connection timeouts to srm.cern.ch triggered the Friday tickets. A disk server had failed but not dropped out of CASTOR. Found and fixed by regular service check on Saturday. Tickets were followed up on Monday. –ATLAS stager failure late Saturday sent SMS to expert who fixed within an hour. –Starting Sunday morning there were two long export interruptions due to the corrupted FTS proxy bug. Understood on Monday and hand-fixed till workarounds put in place Tuesday/Wednesday. Discussed in daily meetings.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 8 Communications Issues ATLAS were unaware that GGUS and CERN tickets are not formally looked at out of normal working hours. A mechanism is in place for each experiment to raise out of hours alarms to the 24-hour operator using a restricted list from trusted users but the ATLAS user concerned (who was on the list) was not aware of this mechanism. It triggers escalation to the permanent system administrator rota who in turn can escalate to the permanent FIO group rota covering CASTOR, FTS and LFC operations problems. We also publicise the CERN operator phone number but not as a 24 hour possibility. Operator alarms do not extend to high level functionality e.g. ATLAS data export should be happening but it is not. The alarms they got over that weekend did not imply escalation. We should look at using the higher level monitoring – SLS and experiment dashboards – see Julia Andreeva Monitoring for CCRC’08 talk.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 9 WLCG Twiki Contacts page

4 March 2008CCRC'08 Feb run - preliminary WLCG report 10 Some observations 1/2 We must standardise and clarify the operator/experiment communications lines at Tier 0 and Tier 1. The management board milestones of providing 24 by 7 support and implementing agreed experiment VO-box Service Level Agreements need to be completed as soon as possible. As expected there were many teething problems in the first two weeks as SRMv2 endpoints were setup (over 160) and early bugs found after which the SRMv2 deployment worked generally well. Missing functionalities in the data management layers have been exposed (the storage solutions working group was closely linked to the February activities) and follow-up planning is in place. The Tier 1 proved fairly reliable and we must follow-up with all of them the ATLAS initiative on asking them to report on how their tape operations were organised and performed.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 11 Some Observations 2/2 Some particular experiment problems were seen at the WLCG level: –ALICE: Only one Tier 1 (FZK) was fully ready, NL-T1 after several days more then the last 3 only on the last day. –ATLAS: Creation of physics mix data sample took much longer than expected and a reduced sample had to be used. –CMS: Inter Tier 1 performance not as good as expected. –LHCb: New version of Dirac had teething problems – 1 week delay. –Only two inter-experiment interferences were logged: FTS congestion at GRIF caused by competing ATLAS and CMS SEs (solved by implementing sub-site channels) and degradation of CMS exports to PIC by ATLAS filling the FTS request queue with retries. Detailed experiment reports this afternoon. We must collect and analyse the various metrics measurements. The electronic log and daily operations meetings proved very useful and will continue. Not many Tier 1 attend the daily phone conference and we need to find out how to make it more useful. Overall a good learning experience and positive result. Activities will continue from now on with the May run acting as a focus point.