WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 23 rd March

Introduction  Period covered: since the beginning of March Only one Service Incident Report Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down (18 hours).  An important “Change Assessment” for upgrade of ATLAS SRM to 2.9  Site availability plots show good service availability throughout this period  Alarm tests: generally OK but…  Update from WLCG Tier1 Service Coordination meeting(s) 2

Service Incidents & Change Assessments Full Change Assessment herehere Summary from assessment (IT-DSS) & ATLAS +IT-ES viewpoints follows 3 SiteWhenIssue (as reported in Service Incident Report) CERN3 MarReplication of LHCb conditions Tier0->Tier1, Tier0->online partially down (18 hours).  Details in next slides SiteWhenChange Assessment & Results CERN9 Marupdate SRM-ATLAS to 2.9, likely rollback

Streams SIR Description Main schema containing condition data of LHCb experiment had to be restored to a point in time in the past as a result of a logical corruption that happened on Tuesday morning. The problem has been reported to PhyDB team around 4 pm on Wednesday. The restore was completed successfully at around 6 pm but unfortunately it disturbed streams replication to LHCB online and 5 Tier1 sites (GridKa, IN2P3, PIC, RAL and SARA). Only replication to CNAF kept Impact Condition data of LHCb experiment was effectively not available: at LHCb online, GridKa and SARA between 4pm and 10 pm on Wednesday 3rd MarchGridKa at IN2P3, PIC and RAL between 4pm on 3rd March and 10 am on Thursday 4th MarchIN2P3 4

Time line of the incident 4pm, Wednesday 3rd March - request to restore the central Conditions data schema submitted to PhyDB service.PhyDB 6pm, 3rd March - restore successfully completed 6pm - 8pm, 3rd March - several streams apply processes aborted. Investigation started. 8pm - 10pm, 3rd March - replication to LHCb online, GridKa and SARA fixedGridKa 9pm - 10am, 4th March - replication to IN2P3, PIC and RAL re- instantiatedIN2P3 Analysis The problem has been traced down to be caused by the default settings of the apply processes on affected sites. In the past, we had to define apply rules on the destinations running two apply processes due to an Oracle bug. These rules, by default, ignore all LCRs (changes) which are tagged. Unfortunately, a datapump import session sets a Streams tag and as the result the data import operation (being part of the schema restore procedure) has been filtered out by the Streams processes. Follow up Configuration of apply processes is been reviewed and will made uniform (apply rules will be removed at all the destination sites). 5

ATLAS SRM 2.9 Upgrade A full Change Assessment was prepared by IT-DSS and discussed with ATLAS This follows on from discussions at the January GDB prompted by ATLAS’ experience with interventions on CASTOR + related in 2009 Highlights: an upgrade schedule was agreed with the possibility to downgrade if the change did not work satisfactorily and the downgrade was tested prior to giving the green light to upgrade the production instance Details from the change assessment are in “backup slides” 6

CERN IT Department CH-1211 Geneva 23 Switzerland t ES CASTOR SRM 2.9 upgrade IT/ES contributed to the upgrade: –Testing new functionality in Pre-Production and validating them against the ATLAS stack –Organizing the stress testing right after the upgrade –Monitoring performance/availability in subsequent days From the IT/ES perspective, the procedure is a good compromise between an aggressive and a conservative approach –Allows to fully validate new functionalities Thanks to Pre-Production SRM endpoint, pointing to CASTOR production instance –Allows scale testing Not at the same scale of CCRC08 or STEP09 But on the other side, in more realistic situations (recent problems caused by chaotic activity rather than high transfer rates) –Possibility of quick rollback guarantees a security margin

SRM 2.9 – Post Change Assessment After the upgrade and testing by ATLAS the change assessment was updated with the experience Details of what happened during the upgrade and during the testing period Not everything went according to plan – one of the main motivations for producing and updating such assessments! Globally successful:  Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version. 8

STANDARD REPORTS 9

Meeting Attendance Summary 10 SiteMTWTF CERNY/Y/Y ASGCY/Y/Y Y/Y/NY/Y/Y BNLY/Y/Y Y/Y/N CNAFY/N/N Y/N/YN/Y/YN/N/N FNALY/Y/Y KITY/Y/Y IN2P3Y/Y/Y NDGFY/Y/Y Y/N/YY/Y/Y NL-T1Y/Y/Y PICY/Y/YN/Y/YY/Y/YN/Y/YY/Y/Y RALY/Y/Y TRIUMF

ATLAS ALICE LHCb CMS March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: Planned outage for maintenance of batch and mass storage 0.2 TAIWAN: Scheduled downtime Wednesday morning. Most services got recovered quickly except for lfc and FTS due to some oracle block error, a 2-hour unscheduled downtime was created for this. These 2 services got recovered at 14: IN2P3: SAM SRM tests failures, disappeared after ~4 hours (problems with BDII. Known performance problem of SL5 BDII) ATLAS 1.1 RAL: SRM overload (tests hitting 10 minutes timeouts). Two ATLASDATADISK servers out with independent problems 1.2 NIKHEF: Problem with one disk server (seem to be due to an Infiniband driver. Kernel timeout values need to be increased) ALICE Nothing to report CMS 3.1 RAL: Temporary tests failures due to deployment of the new version of the File Catalog LHCb 4.1 GRIDKA: SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server 4.2 CNAF: Problems with local batch system, investigating 4.3 NIKHEF: The critical File Access SAM test failing has been understood by the core application devs as some libraries (libgsitunnel) for slc5 platforms not properly deployed in the AA. 1 March

ATLAS ALICE CMS LHCb March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 TAIWAN: Unscheduled power cut ATLAS 1.1 INFN: Enabled checksums for INFN-T1 in FTS. Problems observed, the checksum switched off 1.2 NIKHEF: Unscheduled downtime (FTS is down). Unable to start the transfer agents 1.3 RAL: Disk server out of action, part of ATLAS MC DISK. SRM overload (tests hitting 10 minutes timeouts) 1.4 NDGF: LFC’s host certificates have expired, fixed. LFC daemon giving a core dump, under investigation ALICE Nothing to report CMS 3.1 CNAF: CE SAM test failures - LSF master dying 3.2 IN2P3: SRM test failure (authentication problems) 3.3 CERN: Temporary SAM test failure (timeout) LHCb 4.1 IN2P3: Temporary test failures (software missing) 4.2 NIKHEF: Temporary test failures due to migrating and testing a new test code 4.3 PIC: Application related issues during the w/e on the certification system, accidentally were published in SAM. Experts contacted, fixed 8 March

ATLASALICE CMSLHcb March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: FTS Upgrade to NIKHEF: Two of the disk servers have problems accessing their file systems. ATLAS 1.1 NDGF:LFC's certificate expired. Cloud set offline in Panda and blacklisted in DDM 1.2 NIKHEF:FTS down till morning(15-March). Downgraded to FTS 2.1.Put back in DDM. 1.3 NDGF: Installed new certificate on server. But occasional crashes of lfcdaemon. 1.4 SARA: Some SRM problems observed at SARA quickly fixed. 1.5 NDGF:SAM Test Failure(Timeout when executing test SRMv2-ATLAS-lcg-cp after 600 seconds) ALICE 2.1 KIT: Proxy Registration not working. User is not allowed to register his proxy within the VOBOX CMS 3.1 KIT: SAM test Failure(connection errors to SRM) 3.2 RAL: SAM Tests Failure(Timeout when executing test CE-cms-analysis after 1800 seconds) LHCb 4.1 CNAF: Authentication issues on SRM and gridftp after STORM upgrade. 4.2 PIC:SAM Test Failure.(da Vinci Installation going on) 4.3 RAL:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.4 GRIDKA:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER Some Authentication problem) 4.5 CNAF:SAM Test Failure(Missing Software) 4.6 CERN:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.7 NIKHEF: Problem with disk server 15 March

GGUS SUMMARIES & ALARM TESTS 17

Alarm Ticket Tests (March 10) Initiated by the GGUS developers to the Tier1s, as part of the service verification procedure in 3 slices: Asia/Pacific right after the release, European sites early afternoon (~12:00 UTC), US sites and Canada late afternoon (~18:00 UTC).  The alarm tests in general went well except for… For BNL and FNAL the alarm notifications were sent correctly but the ticket creation in OSG FP system failed due to missing submitter name in the tickets. The alarm tickets for BNL and FNAL have been submitted from home without using a certificate (which had been assumed) “Improved” – retested & OK…  One genuine alarm in last 4 weeks – drill down later 18

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE2013 ATLAS CMS LHCb Totals

3/12/2016WLCG Service Report for the MB 20 Alarm tickets The ALARM tickets were mostly tests following the GGUS Rel. of March 10th. This now happens every month right at the moment of release completion and at a reasonable timezone for the relevant Tier1s.

GGUS Alarms Of the 20 alarm tickets in the last month all are tests except for one TAIWAN LFC not working Detailed description: Dear TAIWAN site admin we found that you come out from your downtime but your LFC is not working [lxplus224] /afs/cern.ch/user/d/digirola > lfc-ping -h lfc.grid.sinica.edu.tw send2nsd: NS000 - name server not available on lfc.grid.sinica.edu.tw nsping: Name server not active Can you please urgently check? To contact the Expert On Call Submitted :57 UTC by Alessandro Di Girolamo Solved :52 but with a new ticket for possibly missing files… 21

WLCG T1SCM OVERVIEW 22

WLCG T1SCM Summary 4 meetings held so far this year (5 th is this week…) Have managed to stick to agenda times – minutes available within a few days (max) of the meetingminutes “Standing agenda” (next) + topical issues: FTS status and rollout; CMS initiative on handling prolonged site downtimes; Review of Alarm Handling & Problem Escalation; Oracle 11g client issues; Good attendance from experiments, service providers at CERN and Tier1s – attendance list will be added 23

WLCG T1SCM Standing Agenda Data Management & Other Tier1 Service Issues Includes update on “baseline versions”, outstanding problems, release update etc. Conditions Data Access & related services Experiment Database Service issues AOB  A full report on the meetings so far will be given to tomorrow’s GDB 24

Summary Stable service during the period of this report Streamlined daily operations meetings & MB reporting Successful use of “Change Assessment” for a significant upgrade of ATLAS SRM – positive feedback encouraging WLCG T1SCM is addressing a wide range of important service related problems on a timescale of 1-2 weeks+ (longer in the case of Prolonged Site downtime strategies) More details on T1SCM in tomorrow’s report to the GDB Critical Services support at Tier0 reviewed at T1SCM Propose an Alarm Test for all such services to check end-end flow 25

BACKUP SLIDES 26

ATLAS SRM 2.9 Description SRM-2.9 fixes issues seen several times by ATLAS, but has major parts rewritten. Deployment should be tested, but ATLAS cannot drive sufficient traffic via the PPS instance to gain enough confidence in the new code. As discussed on , one way to test would be to update the SRM-ATLAS production instance for a defined period (during the day) then rollback (unless ATLAS decides to stay on the new version). The change date is 9 Mar 2010 with a current risk assessment of Medium (potential high impact, but have tested rollback). 27

Testing Performed standard functional test suite on development testbed for and [FIXME: pointer to test scope] standard stress test on development testbed for [FIXME: pointer to test scope] SRM-PPS runs since (but receives little load); runs since ; short functional test passed SRM to downgrade procedure (config/RPM/database) has been validated on SRM-PPS SRM to downgrade procedure (config/RPM/database) has been validated on SRM-PPS SRM DB upgrade script successfully (modulo DB jobs) applied to a DB snapshot 28

Extended downtime during upgrade Is the upgrade transparent or downtime required? DOWNTIME (both on update and downgrade), outstanding SRM transaction will be failed Are there major schema changes ? What is the timing of the downtime? yes, major schema changes. Expected downtime: 30min each What is the impact if the change execution overruns such as a limited change window before production has to restart? No ATLAS data import / export; impact depends on exact date. Risk if upgrade is not performed What problems have been encountered with the current version which are fixed in the new one (tickets/incidents)? major change: remove synchronous stager callbacks, these are responsible for PostMortem12Dec09 and IncidentsSrmSlsRed24Jan2010 (as well as several other periods of unavailability)PostMortem12Dec09IncidentsSrmSlsRed24Jan2010 feature: bug#60503 allows ATLAS to determine whether a file is accessible; should avoid scheduling some FTS that wold otherwise fail (i.e. lower error rate).bug#60503 Divergence between current tested/supported version and installed versions? No divergence yet - SRM-2.9 isn't yet rolled out widely 29

Change Completion Report Did the change go as planned ? What were the problems ? update: took longer than expected (full 30min slot; expected: 5min) mismatch between "production" DB privileges and those on PPS (and on the "snapshot"): DB upgrade script fails with "insufficient privileges". Fixed By Nilo (also on other instances). "scheduled" SW update: introduced other/unrelated RPM changes into the test. Rebooted servers to apply new kernel (within update window) during the test: SLS probe has "lcg_cp: Invalid argument" at 9:40 (NAGIOS dteam/ops tests at 9:30 were OK) - understood, upgrade procedure not fully followed. Peak of activity observed at 9:47:30 (mail from Stephane "we start the test" at 9:35), resulting in ~700 requests being rejected because of thread exhaustion DB high row lock contention observed, cleared up by itself - due to user activity (looping on bringOnline for few files, + SRM-2.9 inefficiency addressed via hotfix) This led to a number of extra requests rejected because of thread exhaustion result from initial tests: ChangesCASTORATLASSRMAtlas29TestParamChange, applied on Thu morning. ChangesCASTORATLASSRMAtlas29TestParamChange Downgrade was postponed after postponed after meeting Tue 15:30 (ATLAS ELOG) as the new version had been running smoothly.ATLAS ELOG Downgrade reviewed Thu 15:30 → keep on 2.9 until after weekend. Consider 2.9 to be "production" if no major issue until Tuesday 16 March.  Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version. 30