WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1.

WLCG Service Report Maria.Girone@cern.ch Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 23 rd March 2010 1

Introduction  Period covered: since the beginning of March Only one Service Incident Report Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down (18 hours).  An important “Change Assessment” for upgrade of ATLAS SRM to 2.9  Site availability plots show good service availability throughout this period  Alarm tests: generally OK but…  Update from WLCG Tier1 Service Coordination meeting(s) 2

Service Incidents & Change Assessments Full Change Assessment herehere Summary from assessment (IT-DSS) & ATLAS +IT-ES viewpoints follows 3 SiteWhenIssue (as reported in Service Incident Report) CERN3 MarReplication of LHCb conditions Tier0->Tier1, Tier0->online partially down (18 hours).  Details in next slides SiteWhenChange Assessment & Results CERN9 Marupdate SRM-ATLAS to 2.9, likely rollback

Streams SIR Description Main schema containing condition data of LHCb experiment (LHCB_CONDDB@LHCBR) had to be restored to a point in time in the past as a result of a logical corruption that happened on Tuesday morning. The problem has been reported to PhyDB team around 4 pm on Wednesday. The restore was completed successfully at around 6 pm but unfortunately it disturbed streams replication to LHCB online and 5 Tier1 sites (GridKa, IN2P3, PIC, RAL and SARA). Only replication to CNAF kept working.LHCB_CONDDB@LHCBRPhyDBGridKaIN2P3 Impact Condition data of LHCb experiment was effectively not available: at LHCb online, GridKa and SARA between 4pm and 10 pm on Wednesday 3rd MarchGridKa at IN2P3, PIC and RAL between 4pm on 3rd March and 10 am on Thursday 4th MarchIN2P3 4

Time line of the incident 4pm, Wednesday 3rd March - request to restore the central Conditions data schema submitted to PhyDB service.PhyDB 6pm, 3rd March - restore successfully completed 6pm - 8pm, 3rd March - several streams apply processes aborted. Investigation started. 8pm - 10pm, 3rd March - replication to LHCb online, GridKa and SARA fixedGridKa 9pm - 10am, 4th March - replication to IN2P3, PIC and RAL re- instantiatedIN2P3 Analysis The problem has been traced down to be caused by the default settings of the apply processes on affected sites. In the past, we had to define apply rules on the destinations running two apply processes due to an Oracle bug. These rules, by default, ignore all LCRs (changes) which are tagged. Unfortunately, a datapump import session sets a Streams tag and as the result the data import operation (being part of the schema restore procedure) has been filtered out by the Streams processes. Follow up Configuration of apply processes is been reviewed and will made uniform (apply rules will be removed at all the destination sites). 5

ATLAS SRM 2.9 Upgrade A full Change Assessment was prepared by IT-DSS and discussed with ATLAS This follows on from discussions at the January GDB prompted by ATLAS’ experience with interventions on CASTOR + related in 2009 Highlights: an upgrade schedule was agreed with the possibility to downgrade if the change did not work satisfactorily and the downgrade was tested prior to giving the green light to upgrade the production instance Details from the change assessment are in “backup slides” 6

CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES CASTOR SRM 2.9 upgrade IT/ES contributed to the upgrade: –Testing new functionality in Pre-Production and validating them against the ATLAS stack –Organizing the stress testing right after the upgrade –Monitoring performance/availability in subsequent days From the IT/ES perspective, the procedure is a good compromise between an aggressive and a conservative approach –Allows to fully validate new functionalities Thanks to Pre-Production SRM endpoint, pointing to CASTOR production instance –Allows scale testing Not at the same scale of CCRC08 or STEP09 But on the other side, in more realistic situations (recent problems caused by chaotic activity rather than high transfer rates) –Possibility of quick rollback guarantees a security margin

SRM 2.9 – Post Change Assessment After the upgrade and testing by ATLAS the change assessment was updated with the experience Details of what happened during the upgrade and during the testing period Not everything went according to plan – one of the main motivations for producing and updating such assessments! Globally successful:  Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version. 8

STANDARD REPORTS 9

Meeting Attendance Summary 10 SiteMTWTF CERNY/Y/Y ASGCY/Y/Y Y/Y/NY/Y/Y BNLY/Y/Y Y/Y/N CNAFY/N/N Y/N/YN/Y/YN/N/N FNALY/Y/Y KITY/Y/Y IN2P3Y/Y/Y NDGFY/Y/Y Y/N/YY/Y/Y NL-T1Y/Y/Y PICY/Y/YN/Y/YY/Y/YN/Y/YY/Y/Y RALY/Y/Y TRIUMF

ATLAS ALICE LHCb 0.1 0.2 0.3 3.1 CMS 0.1 1.2 1.1 0.2 0.1 4.1 4.2 0.3 4.3 1 March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: Planned outage for maintenance of batch and mass storage 0.2 TAIWAN: Scheduled downtime Wednesday morning. Most services got recovered quickly except for lfc and FTS due to some oracle block error, a 2-hour unscheduled downtime was created for this. These 2 services got recovered at 14:15 0.3 IN2P3: SAM SRM tests failures, disappeared after ~4 hours (problems with BDII. Known performance problem of SL5 BDII) ATLAS 1.1 RAL: SRM overload (tests hitting 10 minutes timeouts). Two ATLASDATADISK servers out with independent problems 1.2 NIKHEF: Problem with one disk server (seem to be due to an Infiniband driver. Kernel timeout values need to be increased) ALICE Nothing to report CMS 3.1 RAL: Temporary tests failures due to deployment of the new version of the File Catalog LHCb 4.1 GRIDKA: SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server 4.2 CNAF: Problems with local batch system, investigating 4.3 NIKHEF: The critical File Access SAM test failing has been understood by the core application devs as some libraries (libgsitunnel) for slc5 platforms not properly deployed in the AA. 1 March

ATLAS ALICE CMS LHCb 1.1 1.3 1.2 3.1 3.2 4.1 0.1 3.3 4.3 1.4 4.2 8 March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 TAIWAN: Unscheduled power cut ATLAS 1.1 INFN: Enabled checksums for INFN-T1 in FTS. Problems observed, the checksum switched off 1.2 NIKHEF: Unscheduled downtime (FTS is down). Unable to start the transfer agents 1.3 RAL: Disk server out of action, part of ATLAS MC DISK. SRM overload (tests hitting 10 minutes timeouts) 1.4 NDGF: LFC’s host certificates have expired, fixed. LFC daemon giving a core dump, under investigation ALICE Nothing to report CMS 3.1 CNAF: CE SAM test failures - LSF master dying 3.2 IN2P3: SRM test failure (authentication problems) 3.3 CERN: Temporary SAM test failure (timeout) LHCb 4.1 IN2P3: Temporary test failures (software missing) 4.2 NIKHEF: Temporary test failures due to migrating and testing a new test code 4.3 PIC: Application related issues during the w/e on the certification system, accidentally were published in SAM. Experts contacted, fixed 8 March

ATLASALICE CMSLHcb 1.1 1.2 1.3 1.4 0.1 3.14.1 4.7 4.2 3.2 2.1 0.2 1.5 4.2 4.3 4.4 4.54.1 4.6 15 March

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: FTS Upgrade to 2.2.3 0.2 NIKHEF: Two of the disk servers have problems accessing their file systems. ATLAS 1.1 NDGF:LFC's certificate expired. Cloud set offline in Panda and blacklisted in DDM 1.2 NIKHEF:FTS down till morning(15-March). Downgraded to FTS 2.1.Put back in DDM. 1.3 NDGF: Installed new certificate on server. But occasional crashes of lfcdaemon. 1.4 SARA: Some SRM problems observed at SARA quickly fixed. 1.5 NDGF:SAM Test Failure(Timeout when executing test SRMv2-ATLAS-lcg-cp after 600 seconds) ALICE 2.1 KIT: Proxy Registration not working. User is not allowed to register his proxy within the VOBOX CMS 3.1 KIT: SAM test Failure(connection errors to SRM) 3.2 RAL: SAM Tests Failure(Timeout when executing test CE-cms-analysis after 1800 seconds) LHCb 4.1 CNAF: Authentication issues on SRM and gridftp after STORM upgrade. 4.2 PIC:SAM Test Failure.(da Vinci Installation going on) 4.3 RAL:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.4 GRIDKA:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER Some Authentication problem) 4.5 CNAF:SAM Test Failure(Missing Software) 4.6 CERN:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.7 NIKHEF: Problem with disk server 15 March

GGUS SUMMARIES & ALARM TESTS 17

Alarm Ticket Tests (March 10) Initiated by the GGUS developers to the Tier1s, as part of the service verification procedure in 3 slices: Asia/Pacific right after the release, European sites early afternoon (~12:00 UTC), US sites and Canada late afternoon (~18:00 UTC).  The alarm tests in general went well except for… For BNL and FNAL the alarm notifications were sent correctly but the ticket creation in OSG FP system failed due to missing submitter name in the tickets. The alarm tickets for BNL and FNAL have been submitted from home without using a certificate (which had been assumed) “Improved” – retested & OK…  One genuine alarm in last 4 weeks – drill down later 18

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE2013 ATLAS4010416160 CMS131115 LHCb225128 Totals5713019206 19

3/12/2016WLCG Service Report for the MB 20 Alarm tickets The ALARM tickets were mostly tests following the GGUS Rel. of March 10th. This now happens every month right at the moment of release completion and at a reasonable timezone for the relevant Tier1s.

GGUS Alarms Of the 20 alarm tickets in the last month all are tests except for one - https://gus.fzk.de/ws/ticket_info.php?ticket=56152 https://gus.fzk.de/ws/ticket_info.php?ticket=56152 TAIWAN LFC not working Detailed description: Dear TAIWAN site admin we found that you come out from your downtime https://goc.gridops.org/downtime/list?id=71605446 but your LFC is not working [lxplus224] /afs/cern.ch/user/d/digirola > lfc-ping -h lfc.grid.sinica.edu.tw send2nsd: NS000 - name server not available on lfc.grid.sinica.edu.tw nsping: Name server not active Can you please urgently check? To contact the Expert On Call +41 76 487 5907 https://goc.gridops.org/downtime/list?id=71605446 Submitted 2010-03-03 10:57 UTC by Alessandro Di Girolamo Solved 2010-03-03 19:52 but with a new ticket for possibly missing files… 21

WLCG T1SCM OVERVIEW 22

WLCG T1SCM Summary 4 meetings held so far this year (5 th is this week…) Have managed to stick to agenda times – minutes available within a few days (max) of the meetingminutes “Standing agenda” (next) + topical issues: FTS 2.2.3 status and rollout; CMS initiative on handling prolonged site downtimes; Review of Alarm Handling & Problem Escalation; Oracle 11g client issues; Good attendance from experiments, service providers at CERN and Tier1s – attendance list will be added 23

WLCG T1SCM Standing Agenda Data Management & Other Tier1 Service Issues Includes update on “baseline versions”, outstanding problems, release update etc. Conditions Data Access & related services Experiment Database Service issues AOB  A full report on the meetings so far will be given to tomorrow’s GDB 24

Summary Stable service during the period of this report Streamlined daily operations meetings & MB reporting Successful use of “Change Assessment” for a significant upgrade of ATLAS SRM – positive feedback encouraging WLCG T1SCM is addressing a wide range of important service related problems on a timescale of 1-2 weeks+ (longer in the case of Prolonged Site downtime strategies) More details on T1SCM in tomorrow’s report to the GDB Critical Services support at Tier0 reviewed at T1SCM Propose an Alarm Test for all such services to check end-end flow 25

BACKUP SLIDES 26

ATLAS SRM 2.9 Description SRM-2.9 fixes issues seen several times by ATLAS, but has major parts rewritten. Deployment should be tested, but ATLAS cannot drive sufficient traffic via the PPS instance to gain enough confidence in the new code. As discussed on 2010-02-11, one way to test would be to update the SRM-ATLAS production instance for a defined period (during the day) then rollback (unless ATLAS decides to stay on the new version). The change date is 9 Mar 2010 with a current risk assessment of Medium (potential high impact, but have tested rollback). 27

Testing Performed standard functional test suite on development testbed for 2.9-1 and 2.9-2 [FIXME: pointer to test scope] standard stress test on development testbed for 2.9-1 [FIXME: pointer to test scope] SRM-PPS runs 2.9-1 since 2010-02-12 (but receives little load); runs 2.9-2 since 2010-03-05; short functional test passed SRM-2.9-1 to 2.8-6 downgrade procedure (config/RPM/database) has been validated on SRM-PPS SRM-2.9-2 to 2.8-6 downgrade procedure (config/RPM/database) has been validated on SRM-PPS SRM-2.9-1 DB upgrade script successfully (modulo DB jobs) applied to a DB snapshot 28

Extended downtime during upgrade Is the upgrade transparent or downtime required? DOWNTIME (both on update and downgrade), outstanding SRM transaction will be failed Are there major schema changes ? What is the timing of the downtime? yes, major schema changes. Expected downtime: 30min each What is the impact if the change execution overruns such as a limited change window before production has to restart? No ATLAS data import / export; impact depends on exact date. Risk if upgrade is not performed What problems have been encountered with the current version which are fixed in the new one (tickets/incidents)? major change: remove synchronous stager callbacks, these are responsible for PostMortem12Dec09 and IncidentsSrmSlsRed24Jan2010 (as well as several other periods of unavailability)PostMortem12Dec09IncidentsSrmSlsRed24Jan2010 feature: bug#60503 allows ATLAS to determine whether a file is accessible; should avoid scheduling some FTS that wold otherwise fail (i.e. lower error rate).bug#60503 Divergence between current tested/supported version and installed versions? No divergence yet - SRM-2.9 isn't yet rolled out widely 29

Change Completion Report Did the change go as planned ? What were the problems ? update: took longer than expected (full 30min slot; expected: 5min) mismatch between "production" DB privileges and those on PPS (and on the "snapshot"): DB upgrade script fails with "insufficient privileges". Fixed By Nilo (also on other instances). "scheduled" SW update: introduced other/unrelated RPM changes into the test. Rebooted servers to apply new kernel (within update window) during the test: SLS probe has "lcg_cp: Invalid argument" at 9:40 (NAGIOS dteam/ops tests at 9:30 were OK) - understood, upgrade procedure not fully followed. Peak of activity observed at 9:47:30 (mail from Stephane "we start the test" at 9:35), resulting in ~700 requests being rejected because of thread exhaustion DB high row lock contention observed, cleared up by itself - due to user activity (looping on bringOnline for few files, + SRM-2.9 inefficiency addressed via hotfix) This led to a number of extra requests rejected because of thread exhaustion result from initial tests: ChangesCASTORATLASSRMAtlas29TestParamChange, applied on Thu morning. ChangesCASTORATLASSRMAtlas29TestParamChange Downgrade was postponed after postponed after meeting Tue 15:30 (ATLAS ELOG) as the new version had been running smoothly.ATLAS ELOG Downgrade reviewed Thu 15:30 → keep on 2.9 until after weekend. Consider 2.9 to be "production" if no major issue until Tuesday 16 March.  Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version. 30

WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1.

Similar presentations

Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1.

Similar presentations

Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 23 rd March 2010 1."— Presentation transcript:

Similar presentations

About project

Feedback